standardebooks / tools

The Standard Ebooks toolset for producing our ebook files.
Other
1.43k stars 127 forks source link

Improve t-074 and its test #689

Closed vr8hub closed 6 months ago

vr8hub commented 6 months ago

This changes the t-074 test to look for two consecutive single characters surrounded by dashes. As detailed in our discussion on the other PR, a single character between dashes finds more false positives than using two characters misses (e.g. this will allow to get rid of five ignore files that had to be created today).

I also got rid of the length limit, on the presumption that at least two characters is pretty likely to be a valid hit wherever it's found. (A run on the corpus turned up no false positives.)

(I had already done all this before I tried to pull the PR and saw you had made further changes to the test. Since I had covered more exclusions and had valid entries as well, I went ahead and left my test as it was. It covers everything yours did and more.)

I think a search for a "word" entirely consisting of single-characters and dashes is still something lint can try to catch, but I need to do some more testing before I propose anything. One of the problems is that what's found could be either an unitalicized sound (O-w-w-w-w) or a spelled-out word (w-o-r-d) without grapheme/phoneme tags on the letters, so the message will need to be ambiguous.

acabal commented 6 months ago

Great, thanks!