Open sbraz opened 3 years ago
It's not fixable in the tokenizer, but in the SpellChecker class we do already have some special cases. Currently there's one character of leading and trailing context that's used. Technically that could be extended to cover this, but I'm not really convinced this is worth it. I can't really think of many examples in languages I know, only some weird expressions like hanky-panky or topsy-turvy.
https://github.com/otsaloma/gaupol/blob/master/aeidon/spell.py#L61
Off-topic questtion: does the name "aeidon" mean anything?
I had to pick a name when separating that user-interface independent module from the codebase. Gaupol doesn't mean anything either, so I continued with same style and also wanted the length to match, so that I could do a search replace across the codebase without needing to manually fix some hanging indents.
I can't really think of many examples in languages I know, only some weird expressions like hanky-panky or topsy-turvy.
Here are some ugly one-liners that seem to show words that would not be recognised:
$ grep -P '^\p{L}+-\p{L}+\b' /usr/share/myspell/fr_FR.dic -o | tr '-' '\n' | aspell --list -l fr | sort -u | while read i; do grep -P -- "-$i\b|^$i-" /usr/share/myspell/fr_FR.dic;done | sort -u | wc -l
1315
$ grep -P '^\p{L}+-\p{L}+\b' /usr/share/myspell/en_GB.dic -o | tr '-' '\n' | aspell --list -l en | sort -u | while read i; do grep -P -- "-$i\b|^$i-" /usr/share/myspell/en_GB.dic;done | sort -u | wc -l
892
There are probably some false positives but it's still not negligible IMO. Do you think it would affect performance a lot to take those into account?
If we do, we also need to take into account words like vol-au-vent
for which we'd need to add context in both directions.
I can't run those greps, my Debian doesn't seem to have myspell, only hunspell files and they probably have a different format.
I think maybe we could make the function signature
def check(self, word, extended_word="", leading_context="", trailing_context=""):
And that extended_word
would then extend both ways at least by dashes. It's doable of course. I don't see a performance issue there, just a question of how much to complicate the code for special cases.
Hi, I'm not sure how to work around this issue but I see that the spell checker tokenizer splits words on hyphens: https://github.com/otsaloma/gaupol/blob/30a2ed875f489142e975361d5f19f6821deb97f4/aeidon/spell.py#L255 This can break spellchecking for the following French subtitle:
Although both words are present in the dictionary, they won't be recognised because neither
twin
ortalkie
are listed:I think in general splitting on hyphens is a good idea but maybe we could do something to expand the selection to the full word when the checker returns a mistake. It doesn't seem very straightforward, do you suppose it's worth it?
Off-topic questtion: does the name "aeidon" mean anything?