otsaloma / gaupol

Editor for text-based subtitle files
https://otsaloma.io/gaupol/
GNU General Public License v3.0
251 stars 34 forks source link

Spellcheck does not recognise words containing hyphens #184

Open sbraz opened 3 years ago

sbraz commented 3 years ago

Hi, I'm not sure how to work around this issue but I see that the spell checker tokenizer splits words on hyphens: https://github.com/otsaloma/gaupol/blob/30a2ed875f489142e975361d5f19f6821deb97f4/aeidon/spell.py#L255 This can break spellchecking for the following French subtitle:

1
00:00:00,000 --> 00:00:03,000
twin-set talkie-walkie

Although both words are present in the dictionary, they won't be recognised because neither twin or talkie are listed:

$ grep -P '^(twin|talkie)' /usr/share/myspell/dicts/fr_FR.dic
talkies-walkies/D'Q' po:nom is:mas is:pl
talkie-walkie/L'D'Q' po:nom is:mas is:sg
twin-set/S.() po:nom is:mas

I think in general splitting on hyphens is a good idea but maybe we could do something to expand the selection to the full word when the checker returns a mistake. It doesn't seem very straightforward, do you suppose it's worth it?

Off-topic questtion: does the name "aeidon" mean anything?

otsaloma commented 3 years ago

It's not fixable in the tokenizer, but in the SpellChecker class we do already have some special cases. Currently there's one character of leading and trailing context that's used. Technically that could be extended to cover this, but I'm not really convinced this is worth it. I can't really think of many examples in languages I know, only some weird expressions like hanky-panky or topsy-turvy.

https://github.com/otsaloma/gaupol/blob/master/aeidon/spell.py#L61

Off-topic questtion: does the name "aeidon" mean anything?

I had to pick a name when separating that user-interface independent module from the codebase. Gaupol doesn't mean anything either, so I continued with same style and also wanted the length to match, so that I could do a search replace across the codebase without needing to manually fix some hanging indents.

sbraz commented 3 years ago

I can't really think of many examples in languages I know, only some weird expressions like hanky-panky or topsy-turvy.

Here are some ugly one-liners that seem to show words that would not be recognised:

$ grep -P '^\p{L}+-\p{L}+\b' /usr/share/myspell/fr_FR.dic -o | tr '-' '\n' | aspell --list -l fr | sort -u | while read i; do grep -P -- "-$i\b|^$i-" /usr/share/myspell/fr_FR.dic;done | sort -u | wc -l
1315
$ grep -P '^\p{L}+-\p{L}+\b' /usr/share/myspell/en_GB.dic -o | tr '-' '\n' | aspell --list -l en | sort -u | while read i; do grep -P -- "-$i\b|^$i-" /usr/share/myspell/en_GB.dic;done | sort -u | wc -l
892

There are probably some false positives but it's still not negligible IMO. Do you think it would affect performance a lot to take those into account?

If we do, we also need to take into account words like vol-au-vent for which we'd need to add context in both directions.

otsaloma commented 3 years ago

I can't run those greps, my Debian doesn't seem to have myspell, only hunspell files and they probably have a different format.

I think maybe we could make the function signature

def check(self, word, extended_word="", leading_context="", trailing_context=""):

And that extended_word would then extend both ways at least by dashes. It's doable of course. I don't see a performance issue there, just a question of how much to complicate the code for special cases.