Normalisation is inappropriate

silnrsi / oxttools

Tools for creating language support oxt extensions for LibreOffice

MIT License

6 stars 6 forks source link

Normalisation is inappropriate #6

Open Richard57 opened 5 years ago

Richard57 commented 5 years ago

Normalising the text of the Hunspell dictionary and affix files is inappropriate.

Libreoffice does not normalise text on input, and, unlike say Wikipedia, does not normalise it upon saving. As typing text unnormalised may be the most natural method, it makes sense for a spell-checker to use ICONV in the affix file to bring words to a canonical form. This canonicalisation would be destroyed by normalising the affix file.
Some morphological alternations may most simply be handled by using non-NFC forms in the lexicon.

This normalisation is performed by function zipnfcfile() in script makeoxt.

n8marti commented 1 year ago

I concur with this assertion. The target language may NFC or NFD characters, and makeoxt should be agnostic and hands-off about this. In my case the NFC normalization breaks LO's ability to correctly identify words that use NFD characters because my AFF file does use ICONV, as @Richard57 suggests, and my DIC files uses NFD characters.

DavidLRowe commented 1 year ago

@n8marti I'm currently looking at this. I'd love to have some simple test data, say your AFF file and a DIC file with six words that include NFD characters. I can then make a test file using the words from the DIC file.

n8marti commented 1 year ago

sg-CF.aff.txt sg-CF.dic.txt

I had to rename the files b/c github doesn't like the non-txt extensions. I've made some other changes to these files since I last built my OXT extension, but I think they will still exhibit the problem if you build it with makeoxt.

DavidLRowe commented 1 year ago

Commit 461379c attempts to address this issue

A -n None parameter is added to bypass the default NFC normalization
The zipnfcfile function is replaced by zipnormfile which includes an optional parameter (norm=None) to bypass normalization
The affix file is never normalized, even for the default NFC case. (As @Richard57 and @n8marti pointed out, that was destroying the ICONV information.)

In addition, some changes were made to the documentation. I hope it's okay to have included sg-CF in an example. Thanks, @n8marti, for the sample files.)

I have not yet built the Windows executable, but this should work on Linux. Any feedback welcomed.

DavidLRowe commented 1 year ago

makeoxt.exe available in zip file at https://github.com/silnrsi/oxttools/releases/tag/v0.6

n8marti commented 1 year ago

Great. This (linux version) works for me now, thanks.