minad / jinx

🪄 Enchanted Spell Checker
GNU General Public License v3.0
432 stars 22 forks source link

Trailing dot included in word when using multiple languages #34

Closed yalfg closed 1 year ago

yalfg commented 1 year ago

Hi,

I work mostly in English but sometimes edit texts in French. When setting jinx-languages to support both using ("en_US.UTF-8" "fr_FR.UTF-8") I have a weird behavior when editing US English texts: in all sentences the trailing dot . becomes included with the last word, which is then considered as an error. For example, with This is an example. jinx would report a misspelling for example. and suggest example as a fix. When leaving jinx-languages to its default en_US.UTF-8 all is well (and this is what I use for now).

The issue is easy to reproduce from emacs -Q. In only tried French as extra language, TBC if the issue is specific to this locale or related to enabling multiple languages.

Thanks!

minad commented 1 year ago

There is no obvious fix for this issue since in order to detect words we obtain the word characters from both dictionaries. If they are incompatible we have to manually repair this. I suggest you manually adjust the jinx--syntax-table to not include the dot as a word character. You can do this in a jinx-mode-hook.

EDIT: We should probably maintain syntax tables per dictionary and then split words again depending on those dictionaries before checking them.

yalfg commented 1 year ago

Thanks for reopening ;) I noticed this:

❱ enchant-lsmod-2 -word-chars en_US

❱ enchant-lsmod-2 -word-chars fr_FR
-’'1234567890.

The may be the reason for the issue, with the dot in the list for fr_FR.

I really don't understand the output for French (and I'm French!), I don't know where it's coming from or why it is set this way but it would seems best to set this to null as for US English. I'll try to better understand where it comes from and whether it can be configured. None of these characters should be considered word characters IMHO.

minad commented 1 year ago

I pushed a simple fix for now. We can refine this if more issues come up in various other language combinations.

yalfg commented 1 year ago

Thanks! FYI this comes right from Hunspell French dictionary affix file /usr/share/hunspell/fr.aff, which includes:

WORDCHARS -’'1234567890.

I guess Hunspell itself is fine with it. The challenge here would be on deciding what to use in a mixed language mode, where different languages have different settings. But yes, ignoring the dot will be effective here.

minad commented 1 year ago

The challenge here would be on deciding what to use in a mixed language mode, where different languages have different settings.

Yes, that's the problem.