Exclude embedded images from the Proofing Copy RTF

greyelf commented 10 years ago

Embedded images appear as a large unreadable blobs of base64 text in the Proofing Copy RTF file. Because of this I believe it makes more sense to exclude the images from the generated RTF, which will improve file loading and spell checking of the contents.

mthuurne commented 10 years ago

Would it make sense to exclude other passages as well, such as style sheets and scripts? Maybe only include passsages for which isStoryText() returns True?

greyelf commented 10 years ago

Maybe, though scripts may contain string messages that need spell checking.

The main reason I suggested removing the base64 image is because they appeared as large continuous blocks of meaning less text.

mthuurne commented 10 years ago

I agree that omitting large blocks of base64 encoded data is more urgent, but that doesn't mean the fine tuning has to end there. Script passages tend to have a very low ratio of prose to non-prose and variable names etc. can trigger false positives when spell checking, so I think it would be better not to include them in the proofing copy.

greyelf commented 10 years ago

I was not implying that "fine tuning has to end" as you put it.

My point was that while we can say without doubt that the images will contain no text (that needs checking) and maybe with some certainty the same about the stylesheets, we can't say that about the scripts because we don't know exactly what the Author is going to do.

By removing the scripts we limit the Authors' choices, and they could still use the Ignore/New word feature that most spell-checkers have to handle false-positives.

mthuurne commented 10 years ago

A straightforward way to implement passage filtering would be to include only passages for which isStoryText returns True. In today's code, this would exclude the following items:

the StoryIncludes and StorySettings passages
passages tagged script, stylesheet, annotation, Twine.image, Twine.private, Twine.system

It seems Twine.private and Twine.system are not actually used, but if they were, they should be excluded. Twine.image is already excluded by your patch. StoryIncludes and StorySettings won't include human-readable text, so they're also safe to exclude.

For annotations, scripts and style sheets, I think we can choose from the follow criteria:

include if passage can contain human-readable text; all three would have to be included then
include if passage can contain text that is visible to the end-user (player); scripts and style sheets would have to be included then
include if passage is likely to contain text that is visible to the end-user (player); scripts are debatable in this case, style sheets and annotations would be excluded

Personally I think that script passages shouldn't use large text fragments: that's what the <<display>> macro is for. But there could be short strings that end up displayed as part of the story. So it's a trade-off between the chance of not detecting a bad string vs the time lost on false positives.

My experience with static code checkers tells me that if there are too many false positives, the entire tool is ignored by developers, so limiting false positives is important.

There is of course the option to make it configurable, but I don't like to use that as a way to avoid making design decisions. Configuration should only be used if different users actually want irreconcilable things.

mthuurne commented 10 years ago

When it comes to human-readable text in JavaScript code, it would all be in string literals, correct? So we could include only the string literals in the proofing copy. That would reduce the false positives and the visual clutter while still allowing spell checking of strings in script passages.

greyelf commented 10 years ago

Yes, it would most likely be in string literals.

tweecode / twine

Exclude embedded images from the Proofing Copy RTF #152