giellalt / bugzilla-dummy

0 stars 0 forks source link

problems with diacritic flags in hfst (Bugzilla Bug 1859) #1696

Closed albbas closed 10 years ago

albbas commented 10 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 1859

Date: 2014-04-30T05:39:27+02:00 From: Lene Antonsen <> To: Sjur Nørstebø Moshagen <> CC: lene.antonsen, thomas.omma, tommi.pirinen, trond.trosterud

Last updated: 2014-06-11T12:44:23+02:00

albbas commented 10 years ago

Comment 9399

Date: 2014-04-30 05:39:27 +0200 From: Lene Antonsen <>

Created attachment 177 lexc-file

hfst gir en ekstra sti som xfst ikke gir:

crk$ dcrk atim+N+AN+Obv atim+N+AN+Obv atimwa

mistatim+N+AN+Obv mistatim+N+AN+Obv mistatimwa

^C crk$ hdcrk atim+N+AN+Obv atim+N+AN+Obv atima 0,000000 <=== denne skal ikke være der atim+N+AN+Obv atimwa 0,000000

mistatim+N+AN+Obv mistatim+N+AN+Obv mistatima 0,000000 <=== denne skal ikke være der mistatim+N+AN+Obv mistatimwa 0,000000

crk$ alias hdcrk alias hdcrk='$HLOOKUP $GTHOME/langs/crk/src/generator-gt-desc.hfst' crk$ alias dcrk alias dcrk='$LOOKUP $GTHOME/langs/crk/src/generator-gt-desc.xfst'

Begge generatorne er kompilert samtidig: crk$ ll src/*fst -rw-r--r-- 1 lan000 1907360568 13096 29 apr 16:13 src/generator-gt-desc.xfst -rw-r--r-- 1 lan000 1907360568 78608 29 apr 16:13 src/analyser-gt-desc.hfst

Jeg har ikke sjekka inn endringene, men hvis nødvendig, kan jeg gjør det. Jeg har istedenfor lagt stien inn i en fil som er vedlagt. Jeg håper at jeg har med alle nødvendige deler.

Attached file: bznouns.lexc (application/octet-stream, 8487 bytes) Description: lexc-file

albbas commented 10 years ago

Comment 9404

Date: 2014-04-30 13:30:38 +0200 From: Lene Antonsen <>

crk$ hfst-info HFST packaging: hfst 3.6.1 HFST version: 3.6.1 HFST long version: 300060001 HFST configuration revision: $Revision: 3721 $ OpenFst supported SFST supported foma supported Unicode support: no (hfst)

crk$ xfst -v xfst-2.13.2 (libcfsm-2.18.2) (svn 31774)

albbas commented 10 years ago

Comment 9410

Date: 2014-05-01 05:08:46 +0200 From: Lene Antonsen <>

Jeg sjekket alle yaml.testene for substantiver i crk. xfst og hfst gir ikke samme resultat, gjennomgående kommer hfst dårligere ut ved at det er flere som fails = flere genereringer ?

  1. Total passes: 86, Total fails: 76, Total: 162 xfst Total passes: 86, Total fails: 87, Total: 173 hfst
  2. Total passes: 68, Total fails: 99, Total: 167 xfst Total passes: 68, Total fails: 116, Total: 184 hfst
  3. Total passes: 24, Total fails: 186, Total: 210 xfst <=== her kommer hfst bedre ut Total passes: 26, Total fails: 204, Total: 230 hfst <====
  4. Total passes: 68, Total fails: 103, Total: 171 xfst Total passes: 68, Total fails: 115, Total: 183 hfst
  5. Total passes: 0, Total fails: 32, Total: 32 xfst Total passes: 0, Total fails: 32, Total: 32 hfst
  6. Total passes: 118, Total fails: 4, Total: 122 xfst Total passes: 118, Total fails: 4, Total: 122 hfst
  7. Total passes: 54, Total fails: 136, Total: 190 xfst Total passes: 54, Total fails: 148, Total: 202 hfst
  8. Total passes: 36, Total fails: 164, Total: 200 xfst Total passes: 36, Total fails: 181, Total: 217 hfst
  9. Total passes: 110, Total fails: 20, Total: 130 xfst Total passes: 110, Total fails: 20, Total: 130 hfst
  10. Total passes: 64, Total fails: 110, Total: 174 xfst Total passes: 70, Total fails: 111, Total: 181 hfst
  11. Total passes: 72, Total fails: 95, Total: 167 xfst Total passes: 72, Total fails: 113, Total: 185 hfst
  12. Total passes: 54, Total fails: 128, Total: 182 xfst Total passes: 54, Total fails: 134, Total: 188 hfst
albbas commented 10 years ago

Comment 9411

Date: 2014-05-01 21:40:55 +0200 From: Trond Trosterud <>

The attached file works fine in xfst. To repeat:

xfst -e "read lexc < bznouns.lexc" down atim+N+AN+Obv

etc., and I get the same results as Lene.

But when doing the same in hfst (note the different syntax), I run into trouble:

hfst-xfst read lexc bznouns.lexc down atim+N+AN+Obv

Instead of getting the expected double forms I get ???

And with random-upper and random-lower I get:

hfst[1]: random-upper atim+N+AN@0@+Obv mistatim+N+AN@0@+Obv hfst[1]: random-lower atim@0@@0@wa mistatim@0@@0@wa

Now, this may be due to my lack of familiarity with hfst-xfst.

If Sjur or others with more knowledge of hfst may repeat Lenes results, please report.

If not, I suggest Lene attaches her version of the nouns.lexc file, so that we can put it in the appropriate catalogue and test there.

albbas commented 10 years ago

Comment 9412

Date: 2014-05-01 22:18:21 +0200 From: Trond Trosterud <>

Now I was able to repeat the test with Lenes source code:

Here is what I did: For xfst, I read the file bznouns.lexc (the file attached to this bug), inverted it, and saved as ix Here I thus did:

xfst -e "read lexc < bznouns.lexc" invert net save ix

Since hfst read their lexc files "upside down", here I did not invert, but did the following:

hfst-xfst read lexc bznouns.lexc save h

Then I generated both forms in both transducers:

$ echo mistatim+N+AN+Obv | hfst-lookup -q h mistatim+N+AN+Obv mistatimwa 0.000000

$ echo mistatim+N+AN+Obv | lookup -q ix mistatim+N+AN+Obv mistatimwa

$ echo atim+N+AN+Obv | hfst-lookup -q h atim+N+AN+Obv atimwa 0.000000

$ echo atim+N+AN+Obv | lookup -q ix atim+N+AN+Obv atimwa

So, the mystical thing here is that I am not able to repeat Lenes results. On the contrary, I get the two transducers to behave identically.

albbas commented 10 years ago

Comment 9413

Date: 2014-05-02 06:48:34 +0200 From: Sjur Nørstebø Moshagen <>

(In reply to comment #4)

So, the mystical thing here is that I am not able to repeat Lenes results. On the contrary, I get the two transducers to behave identically.

This goes well together with my suspicions that the source of the difference is the interpretation and handling of rule conflicts in twolc. Your test setup did only involve lexc, and thus you get identical behaviour.

We know that some types of twolc conflicts are not flagged or marked at all by Xerox, but are flagged by hfst, and that such conflicts are resolved (or not) in different ways by the two. The best approach to this problem is probably to take a thorough look at the twolc output (cd src/phonology/; make clean; make V=1), and work on the rules till all conflicts are resolved manually.

albbas commented 10 years ago

Comment 9432

Date: 2014-05-27 15:37:45 +0200 From: Lene Antonsen <>

Denne saka er aktuelisert etter siste yaml-fix, fordi yaml nå varsler om overgenerering, som er stor med hfst.

Jeg kommenterte ut hele twolc- bortsett fra den aller første regelen som bare gjelder verb (har en dummy fra verbfila). "h glottal stop for initial vowel stems in Conjunctive" !! @RULENAME@ %^EGLOT:h <=> _ %>:0 Vow: ;

make clean make

crk$ hdcrk amisk+N+AN+Pl amisk+N+AN+Pl amisk 0,000000 <=====
amisk+N+AN+Pl amiskak 0,000000

^C crk$ dcrk amisk+N+AN+Pl amisk+N+AN+Pl amiskak

Kan det være suffiksmerket som hfst behandler annerledes enn xfst?

Lenger opp i buggen er mine versjoner.

albbas commented 10 years ago

Comment 9433

Date: 2014-05-28 09:13:28 +0200 From: Sjur Nørstebø Moshagen <>

This is definitely caused by differences in conflict handling in twolc parsing between Hfst and Xerox. But I am not so sure that Xerox is to blame anymore. Here are the conflicts as detected by HFST:

There is a =>-rule conflict between "Suffix vowel deletion in vowel final stems SUBCASE: Vx=i" and "i:0 after w/y ". There is a =>-rule conflict between "Suffix vowel deletion in vowel final stems SUBCASE: Vx=o" and "o:0 in possessive prefix". There is a =>-rule conflict between "Double consonant deletion SUBCASE: Cx=s" and "Diminutives rule change ending to os with k-final stems 1". There is a =>-rule conflict between "locative alternations o" and "Diminutives rule change ending to os with k-final stems 2". There is a =>-rule conflict between "Suffix vowel deletion in vowel final stems SUBCASE: Vx=i" and "i:0 after w/y " and "Diminutives rule change ending delete i with nouns ending in kwa".

And here are the conflicts as detected by Xerox - each conflict is prefixed with the corresponding conflict in HFST:

2 - >>> Resolving a => conflict with respect to 'o:0' between "Suffix vowel deletion in vowel final stems" and "o:0 in possessive prefix" 1/5 - >>> Resolving a => conflict with respect to 'i:0' between "Suffix vowel deletion in vowel final stems" and "i:0 after w/y " 5 - >>> Resolving a => conflict with respect to 'i:0' between "Suffix vowel deletion in vowel final stems" and "Diminutives rule change ending delete i with nouns ending in kwa" 4 - >>> Resolving a => conflict with respect to 'i:o' between "locative alternations o" and "Diminutives rule change ending to os with k-final stems 2" 1/5 - >>> Resolving a => conflict with respect to 'i:0' between "i:0 after w/y " and "Diminutives rule change ending delete i with nouns ending in kwa" 0 - >>> Resolving a => conflict with respect to 'w:0 | y:0' between "w/y:0 in front of suffixes" and "Double consonant deletion" 3 - >>> Resolving a => conflict with respect to 's:0' between "Double consonant deletion" and "Diminutives rule change ending to os with k-final stems 1"

As can be seen above, three conflicts in Xerox are treated as two conflicts in Hfst, and one conflict is not detected at all (prefixed with 0/zero).

This definitely looks like a bug in the twolc compilation in Hfst, and should be resolved there. In the meantime, the best solution to make Hfst and Xerox behave the same, is to rewrite the rule contexts such that there are no conflicts at all - that is, resolve the conflicts by hand.

For future reference, this output was produced with the following source code revisions and tool versions:

$ $GTCORE/scripts/gt-version.sh 0.2.13-94833

$ svn info Path: /Users/smo036/langtech/main/langs/crk Working Copy Root Path: /Users/smo036/langtech/main URL: https://victorio.uit.no/langtech/trunk/langs/crk Repository Root: https://victorio.uit.no/langtech Repository UUID: c7155fb1-f0a7-4240-a2fc-2600b6f42f90 Revision: 94920 Node Kind: directory Schedule: normal Last Changed Author: lene Last Changed Rev: 94918 Last Changed Date: 2014-05-28 05:17:06 +0000 (ons, 28 mai 2014)

$ hfst-twolc --version

hfst-twolc 0 (hfst 3.7.0) Copyright (C) 2010 University of Helsinki, License GPLv3: GNU GPL version 3 http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

$ hfst-info No tests selected; printing known data HFST info version: 0.1 HFST packaging: hfst 3.7.0 HFST version: 3.7.0 HFST long version: 300070000 HFST configuration revision: $Revision: 3859 $ OpenFst supported SFST supported Unicode support: no (hfst)

$ twolc -v twolc-3.4.13 (2.25.11)

albbas commented 10 years ago

Comment 9434

Date: 2014-05-28 09:24:37 +0200 From: Sjur Nørstebø Moshagen <>

The Hfst bug is reported to the Hfst team as https://sourceforge.net/p/hfst/bugs/245/.

albbas commented 10 years ago

Comment 9435

Date: 2014-05-28 09:31:56 +0200 From: Lene Antonsen <>

(In reply to comment #7)

This is definitely caused by differences in conflict handling in twolc parsing between Hfst and Xerox. But I am not so sure that Xerox is to blame anymore. Here are the conflicts as detected by HFST:

Jeg minner om at når jeg kommenterer ut nesten alle twolregler (beholder en for kompileringa), og make clean før make:

Jeg har helt nye fst:er:

-rw-r--r-- 1 lan000 1907360568 105481 28 mai 01:29 src/generator-gt-norm.hfst -rw-r--r-- 1 lan000 1907360568 13633 28 mai 01:29 src/generator-gt-norm.xfst

Likevel: crk$ dcrk amisk+N+AN+Pl amisk+N+AN+Pl amiskak

^C crk$ hdcrk amisk+N+AN+Pl amisk+N+AN+Pl amisk 0,000000 amisk+N+AN+Pl amiskak 0,000000

albbas commented 10 years ago

Comment 9436

Date: 2014-05-28 10:42:04 +0200 From: Sjur Nørstebø Moshagen <>

(In reply to comment #9)

Jeg minner om at når jeg kommenterer ut nesten alle twolregler (beholder en for kompileringa), og make clean før make: [...] crk$ dcrk amisk+N+AN+Pl amisk+N+AN+Pl amiskak

^C crk$ hdcrk amisk+N+AN+Pl amisk+N+AN+Pl amisk 0,000000 amisk+N+AN+Pl amiskak 0,000000

Denne skilnaden kjem frå LexC utan at eg kan forklara kvifor:

$ hfst-lookup -q src/morphology/crk.lexc.hfst amisk+N+AN+Pl amisk+N+AN+Pl >amisk 0,000000 amisk+N+AN+Pl >amisk>ak 0,000000

amisk+N+AN+Sg
amisk+N+AN+Sg >amisk 0,000000 amisk+N+AN+Sg >amisk>ak 0,000000

$ lookup -q src/morphology/crk.lexc.xfst

amisk amisk amisk +N+AN+Sg

amisk>ak amisk>ak amisk +N+AN+Pl

albbas commented 10 years ago

Comment 9437

Date: 2014-05-28 10:43:44 +0200 From: Sjur Nørstebø Moshagen <>

(In reply to comment #10)

Denne skilnaden kjem frå LexC utan at eg kan forklara kvifor:

Er det flag-diakritika involvert i numerusbøyinga av amisk?

albbas commented 10 years ago

Comment 9438

Date: 2014-05-28 11:16:14 +0200 From: Trond Trosterud <>

Er det flag-diakritika involvert i numerusbøyinga av amisk?

stems/nouns.lexc:

LEXICON AN-IN @U.noun.abs@ STEMS ; < 0:n 0:i "@U.noun.1sg@" 0:"t2" > STEMS ; ! 1 < 0:k 0:i "@U.noun.2sg@" 0:"t2" > STEMS ; ! 2 ... LEXICON STEMS !! @LEXNAME@ add a affixmark and redirects to STEMLIST 0:%> STEMLIST ;

LEXICON STEMLIST !! @LEXNAME@ for nouns getting prefixes ni-, ki-, o- amisk ANimDECL "beaver" ; !yaml ...

Eventyret held fram i affixes/nouns.lexc:

LEXICON ANABSDECL !!= * @CODE@ for the animate absolute declension < "+N":0 "+AN":0 "+Sg":0 "@U.noun.abs@" > SG_ ; ! < "+N":0 "+AN":0 "@U.noun.abs@" > OBVIATIVE ; ! < "+N":0 "+AN":0 "+Pl":0 "@U.noun.abs@" > PLak ; ! < "+N":0 "+AN":0 "@U.noun.abs@" > LOC ; ! < "+N":i "+AN":n "@U.noun.abs@" > LOCahk ; !

Det fungerer slik:

alle nomen kan ha Px, og dei fleste kan ha absolutt (px-laus) böying. For å skilje har vi flagg. Så alle nomen (og verb, for den del) har flagg.

albbas commented 10 years ago

Comment 9440

Date: 2014-05-29 19:10:54 +0200 From: Lene Antonsen <>

I have been experimenting, and my theory is that hfst confuses the paths when the same diacritics is used in more than one path:

This lexicon remains the same in the experiments: LEXICON NONLOCahk !!= * @CODE@ for the animate absolute except LOC on ahk +Sg: SG ; OBVIATIVE ; +Pl: PLak ; LOC ;

1) LEXICON ANABSDECL !!= * @CODE@ for the animate absolute declension < "+N":0 "+AN":0 "@U.noun.abs@" > NON_LOCahk ; ! < "+N":i "+AN":n "@U.noun.abs@" > LOCahk ; !

crk$ hdcrk amisk+N+AN+Sg amisk+N+AN+Sg amisk 0,000000 amisk+N+AN+Sg amiskin 0,000000 <====-in comes from the other path in ANABSDECL lexicon!

crk$ dcrk amisk+N+AN+Sg amisk+N+AN+Sg amisk

2) LEXICON ANABSDECL !!= * @CODE@ for the animate absolute declension @U.noun.abs@ DECL ;

LEXICON DECL +N+AN: NON_LOCahk ; ! +N+AN:in LOCahk ; !

crk$ hdcrk amisk+N+AN+Sg amisk+N+AN+Sg amisk 0,000000

^C crk$ dcrk amisk+N+AN+Sg amisk+N+AN+Sg amisk

This does not happen in this lexicon, because all the same flag is not used in two paths: LEXICON ANSUFFSG !!= * @CODE@ < "+Px1Sg":%^POS "@U.noun.1sg@" > SG ; ! < "+Px2Sg":%^POS "@U.noun.2sg@" > SG ; ! < "+Px3Sg":%^POS "@U.noun.3sg@" > OBVIATIVE ; ! < "+Px4Sg":%^POS 0:i 0:y 0:i 0:w "@U.noun.3isg@" > OBVIATIVE ; ! < "+Px1Pl":%^POS 0:i 0:n 0:â 0:n "@U.noun.1pl@" > SG ; ! exclusive Pl -nân not -inân CHECK ?? <"+Px12Pl":%^POS 0:i 0:n 0:a 0:w "@U.noun.12pl@" > SG ; ! inclusive Pl CHECK OK? < "+Px2Pl":%^POS 0:i 0:w 0:â 0:w "@U.noun.2pl@" > SG ; ! < "+Px3Pl":%^POS 0:i 0:w 0:â 0:w "@U.noun.3pl@" > OBVIATIVE ; ! < "+Px4Pl":%^POS 0:i 0:y 0:i 0:w "@U.noun.3ipl@" > OBVIATIVE ; ! obviative plural possessor - Okimasis corrected

albbas commented 10 years ago

Comment 9441

Date: 2014-05-31 17:31:53 +0200 From: Lene Antonsen <>

To illustrate my theory the contlexis IICONJ and IICONJw are done differently for verbs in crk. Hfst overgenerates less for IICONJ than for IICONJw.

Try it out: cat test/data/VII-par.txt | sed 's/^/mihkwâw/' | hdcrk |l cat test/data/VII-par.txt | sed 's/^/mihkwâw/' | dcrk |l cat test/data/VII-par.txt | sed 's/^/miywâsin/' | dcrk |l cat test/data/VII-par.txt | sed 's/^/miywâsin/' | hdcrk |l

To get rid of all overgeneration for hfst, we cannot use the same prefix-lexicon for Cnj and Indep, because hfst doesn't allow the same diacr.flag for different paths. This is a bug in hfst, and should be fixed. If not, we have to make a prefixlexicon for each path, that means quite many. Till then, we'll use only xfst for generation and analysis.

albbas commented 10 years ago

Comment 9442

Date: 2014-05-31 17:51:51 +0200 From: Lene Antonsen <>

(In reply to comment #14)

Try it out: cat test/data/VII-par.txt | sed 's/^/mihkwâw/' | hdcrk |l cat test/data/VII-par.txt | sed 's/^/mihkwâw/' | dcrk |l cat test/data/VII-par.txt | sed 's/^/miywâsin/' | dcrk |l cat test/data/VII-par.txt | sed 's/^/miywâsin/' | hdcrk |l

or one can look at the yaml-tests for these to verbs which both pass with xfst, but not with hfst. Be aware of that the other yaml-tests for verbs in crk, are not corrected yet. Both tags-strings and wordforms have to be corrected.

albbas commented 10 years ago

Comment 9443

Date: 2014-05-31 22:22:17 +0200 From: Lene Antonsen <>

I've tried out different combinations of flags, ee. "@D.mood.cnj@" (dismiss), but it doesn't function with hfst.

albbas commented 10 years ago

Comment 9444

Date: 2014-06-02 09:05:59 +0200 From: Sjur Nørstebø Moshagen <>

The flag diacritics bug is now reported to the Hfst team as https://sourceforge.net/p/hfst/bugs/247/.

albbas commented 10 years ago

Comment 9445

Date: 2014-06-02 09:18:21 +0200 From: Sjur Nørstebø Moshagen <>

Summary for the Hfst team:

This bug report relating to Plains Cree seems to have detected two Hfst bugs:

  1. a possible bug in twolc compilation (see comment #7 and http://sourceforge.net/p/hfst/bugs/245/)
  2. a bug in the handling of flag diacritics (see most other comments and http://sourceforge.net/p/hfst/bugs/247/)
albbas commented 10 years ago

Comment 9456

Date: 2014-06-03 15:54:26 +0200 From: Sjur Nørstebø Moshagen <>

Last comments from IRC:

[4:24pm] meriponi: i'm running: hfst-lookup --xfst=show-flags ././../../src/generator-gt-norm.hfst [4:24pm] meriponi: and get: [4:24pm] meriponi: > amisk+N+AN+Sg [4:24pm] meriponi: amisk+N+AN+Sg @U.noun.abs@amisk@U.noun.abs@ 0.000000 [4:24pm] meriponi: amisk+N+AN+Sg @U.noun.abs@amiskin@U.noun.abs@ 0.000000

(meriponi = one of the hfst guys)

Then from me:

[4:42pm] sjnomos: $ xfst -s src/generator-gt-norm.xfst [4:42pm] sjnomos: xfst[1]: set show-flags ON [4:42pm] sjnomos: variable show-flags = ON [4:42pm] sjnomos: xfst[1]: up amisk+N+AN+Sg [4:42pm] sjnomos: @U.noun.abs@amisk@U.noun.abs@

That is, correct behaviour with the flags in both fst's, but wrong flag in one case in hfst. The suspicion now goes to the LexC parser in hfst. The bug hunting continues...

albbas commented 10 years ago

Comment 9481

Date: 2014-06-11 12:44:23 +0200 From: Sjur Nørstebø Moshagen <>

This bug is fixed in the hfst code, both the lexc handling of flag diacritics with regex brackets <>, and the twolc inconsistencies compared to Xerox. That is, hfst transducers now behave as they should:

$ echo "amisk+N+AN+Pl" | hfst-lookup -p -q generator-raw-gt-desc.hfst amisk+N+AN+Pl amiskwak 0,000000

(no "amisk" anymore in the generated output)

Unfortunately, I have broken the hfst build in the new infra, so we still can't build the hfst transducers. I am working on that.

I mark this bug as fixed.