Trimming substrings is unsound

simoncozens commented 1 year ago

Random thought: you go through the wordlist and remove words which are substrings of another string. I guess the thinking is “rats has all the letters of rat so we don’t need to test rat by itself.” This is probably true for Latin but unsound in general. The easiest way to see why is to imagine your input is Arabic. “rat” has a final t but “rats” has a medial t; the t is doing different things in the two cases, so it’s not correct to use a super-string to “include” a test for a substring. Similarly for anything which does contextual stuff based on letter position - including Latin handwriting fonts…

I don’t know how much difference this makes in practice given a big enough word list, but I’m not convinced it’s something that there’s a logical basis for doing.

m4rc1e commented 1 year ago

Fair point but doesn't Arabic have unicodes for each positional form though?

I think I'll do some coverage tests (check how many gids hb has seen) to see what the damage is.

simoncozens commented 1 year ago

Fair point but doesn't Arabic have unicodes for each positional form though?

Yes but no. You will not find text encoded in the "presentation forms"; normally, for each positional form, the Unicode character is the same and the shaper changes the glyph to the positional form. ہہہ is U+06C1 U+06C1 U+06C1.

Even for Latin, you might have a handwriting font which provides a "final form" for "t" but not for "s".

m4rc1e commented 1 year ago

Instead of removing character substrings, i may try removal based on harfbuzz gid sequences.

googlefonts / diffenator2

Trimming substrings is unsound #3