giellalt / lang-sje

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Pite Sami language
https://giellalt.uit.no
Creative Commons Attribution 4.0 International
3 stars 0 forks source link

generating (desc) a wordform returns two wordforms for each character with a diacritic #1

Closed jeutzsch closed 2 days ago

jeutzsch commented 2 weeks ago

When generating a word form using $HLOOKUP $GTHOME/langs/lang-sje/src/fst/generator-gt-desc.hfstol, any character with a diacritic triggers two outputs: one with the single, precomposed unicode character, and one with a double, decomposed unicode (base + combining diacritic). For example: gähppe+A+Sg+Nom gähppe+A+Sg+Nom gähppe (precomposed output) gähppe+A+Sg+Nom gähppe (base+combing output)

This, If there are two characters with diacritics, then there are four outputs!: härrá+N+Sg+Nom härrá+N+Sg+Nom härrá härrá+N+Sg+Nom härrá härrá+N+Sg+Nom härrá härrá+N+Sg+Nom härrá

I've tried modifying spellrelax.regexbut that didn't change anything.

snomos commented 2 weeks ago

@flammie could you have a look? Also @Trondtr?

flammie commented 2 days ago

sorry I forgot to push this earlier, the Unicode Normalisation Form filter is generated automatically nowadays, we just applied the relax filters to both desc automata but I think it makes sense not to have it on generator so I've removed it in giella-core.