Closed albbas closed 10 years ago
Date: 2014-04-30 05:39:27 +0200
From: Lene Antonsen <
Created attachment 177 lexc-file
hfst gir en ekstra sti som xfst ikke gir:
crk$ dcrk atim+N+AN+Obv atim+N+AN+Obv atimwa
mistatim+N+AN+Obv mistatim+N+AN+Obv mistatimwa
^C crk$ hdcrk atim+N+AN+Obv atim+N+AN+Obv atima 0,000000 <=== denne skal ikke være der atim+N+AN+Obv atimwa 0,000000
mistatim+N+AN+Obv mistatim+N+AN+Obv mistatima 0,000000 <=== denne skal ikke være der mistatim+N+AN+Obv mistatimwa 0,000000
crk$ alias hdcrk alias hdcrk='$HLOOKUP $GTHOME/langs/crk/src/generator-gt-desc.hfst' crk$ alias dcrk alias dcrk='$LOOKUP $GTHOME/langs/crk/src/generator-gt-desc.xfst'
Begge generatorne er kompilert samtidig: crk$ ll src/*fst -rw-r--r-- 1 lan000 1907360568 13096 29 apr 16:13 src/generator-gt-desc.xfst -rw-r--r-- 1 lan000 1907360568 78608 29 apr 16:13 src/analyser-gt-desc.hfst
Jeg har ikke sjekka inn endringene, men hvis nødvendig, kan jeg gjør det. Jeg har istedenfor lagt stien inn i en fil som er vedlagt. Jeg håper at jeg har med alle nødvendige deler.
Attached file: bznouns.lexc (application/octet-stream, 8487 bytes) Description: lexc-file
Date: 2014-04-30 13:30:38 +0200
From: Lene Antonsen <
crk$ hfst-info HFST packaging: hfst 3.6.1 HFST version: 3.6.1 HFST long version: 300060001 HFST configuration revision: $Revision: 3721 $ OpenFst supported SFST supported foma supported Unicode support: no (hfst)
crk$ xfst -v xfst-2.13.2 (libcfsm-2.18.2) (svn 31774)
Date: 2014-05-01 05:08:46 +0200
From: Lene Antonsen <
Jeg sjekket alle yaml.testene for substantiver i crk. xfst og hfst gir ikke samme resultat, gjennomgående kommer hfst dårligere ut ved at det er flere som fails = flere genereringer ?
Date: 2014-05-01 21:40:55 +0200
From: Trond Trosterud <
The attached file works fine in xfst. To repeat:
xfst -e "read lexc < bznouns.lexc" down atim+N+AN+Obv
etc., and I get the same results as Lene.
But when doing the same in hfst (note the different syntax), I run into trouble:
hfst-xfst read lexc bznouns.lexc down atim+N+AN+Obv
Instead of getting the expected double forms I get ???
And with random-upper and random-lower I get:
hfst[1]: random-upper atim+N+AN@0@+Obv mistatim+N+AN@0@+Obv hfst[1]: random-lower atim@0@@0@wa mistatim@0@@0@wa
Now, this may be due to my lack of familiarity with hfst-xfst.
If Sjur or others with more knowledge of hfst may repeat Lenes results, please report.
If not, I suggest Lene attaches her version of the nouns.lexc file, so that we can put it in the appropriate catalogue and test there.
Date: 2014-05-01 22:18:21 +0200
From: Trond Trosterud <
Now I was able to repeat the test with Lenes source code:
Here is what I did: For xfst, I read the file bznouns.lexc (the file attached to this bug), inverted it, and saved as ix Here I thus did:
xfst -e "read lexc < bznouns.lexc" invert net save ix
Since hfst read their lexc files "upside down", here I did not invert, but did the following:
hfst-xfst read lexc bznouns.lexc save h
Then I generated both forms in both transducers:
$ echo mistatim+N+AN+Obv | hfst-lookup -q h mistatim+N+AN+Obv mistatimwa 0.000000
$ echo mistatim+N+AN+Obv | lookup -q ix mistatim+N+AN+Obv mistatimwa
$ echo atim+N+AN+Obv | hfst-lookup -q h atim+N+AN+Obv atimwa 0.000000
$ echo atim+N+AN+Obv | lookup -q ix atim+N+AN+Obv atimwa
So, the mystical thing here is that I am not able to repeat Lenes results. On the contrary, I get the two transducers to behave identically.
Date: 2014-05-02 06:48:34 +0200
From: Sjur Nørstebø Moshagen <
(In reply to comment #4)
So, the mystical thing here is that I am not able to repeat Lenes results. On the contrary, I get the two transducers to behave identically.
This goes well together with my suspicions that the source of the difference is the interpretation and handling of rule conflicts in twolc. Your test setup did only involve lexc, and thus you get identical behaviour.
We know that some types of twolc conflicts are not flagged or marked at all by Xerox, but are flagged by hfst, and that such conflicts are resolved (or not) in different ways by the two. The best approach to this problem is probably to take a thorough look at the twolc output (cd src/phonology/; make clean; make V=1), and work on the rules till all conflicts are resolved manually.
Date: 2014-05-27 15:37:45 +0200
From: Lene Antonsen <
Denne saka er aktuelisert etter siste yaml-fix, fordi yaml nå varsler om overgenerering, som er stor med hfst.
Jeg kommenterte ut hele twolc- bortsett fra den aller første regelen som bare gjelder verb (har en dummy fra verbfila). "h glottal stop for initial vowel stems in Conjunctive" !! @RULENAME@ %^EGLOT:h <=> _ %>:0 Vow: ;
make clean make
crk$ hdcrk
amisk+N+AN+Pl
amisk+N+AN+Pl amisk 0,000000 <=====
amisk+N+AN+Pl amiskak 0,000000
^C crk$ dcrk amisk+N+AN+Pl amisk+N+AN+Pl amiskak
Kan det være suffiksmerket som hfst behandler annerledes enn xfst?
Lenger opp i buggen er mine versjoner.
Date: 2014-05-28 09:13:28 +0200
From: Sjur Nørstebø Moshagen <
This is definitely caused by differences in conflict handling in twolc parsing between Hfst and Xerox. But I am not so sure that Xerox is to blame anymore. Here are the conflicts as detected by HFST:
There is a =>-rule conflict between "Suffix vowel deletion in vowel final stems SUBCASE: Vx=i" and "i:0 after w/y ". There is a =>-rule conflict between "Suffix vowel deletion in vowel final stems SUBCASE: Vx=o" and "o:0 in possessive prefix". There is a =>-rule conflict between "Double consonant deletion SUBCASE: Cx=s" and "Diminutives rule change ending to os with k-final stems 1". There is a =>-rule conflict between "locative alternations o" and "Diminutives rule change ending to os with k-final stems 2". There is a =>-rule conflict between "Suffix vowel deletion in vowel final stems SUBCASE: Vx=i" and "i:0 after w/y " and "Diminutives rule change ending delete i with nouns ending in kwa".
And here are the conflicts as detected by Xerox - each conflict is prefixed with the corresponding conflict in HFST:
2 - >>> Resolving a => conflict with respect to 'o:0' between "Suffix vowel deletion in vowel final stems" and "o:0 in possessive prefix" 1/5 - >>> Resolving a => conflict with respect to 'i:0' between "Suffix vowel deletion in vowel final stems" and "i:0 after w/y " 5 - >>> Resolving a => conflict with respect to 'i:0' between "Suffix vowel deletion in vowel final stems" and "Diminutives rule change ending delete i with nouns ending in kwa" 4 - >>> Resolving a => conflict with respect to 'i:o' between "locative alternations o" and "Diminutives rule change ending to os with k-final stems 2" 1/5 - >>> Resolving a => conflict with respect to 'i:0' between "i:0 after w/y " and "Diminutives rule change ending delete i with nouns ending in kwa" 0 - >>> Resolving a => conflict with respect to 'w:0 | y:0' between "w/y:0 in front of suffixes" and "Double consonant deletion" 3 - >>> Resolving a => conflict with respect to 's:0' between "Double consonant deletion" and "Diminutives rule change ending to os with k-final stems 1"
As can be seen above, three conflicts in Xerox are treated as two conflicts in Hfst, and one conflict is not detected at all (prefixed with 0/zero).
This definitely looks like a bug in the twolc compilation in Hfst, and should be resolved there. In the meantime, the best solution to make Hfst and Xerox behave the same, is to rewrite the rule contexts such that there are no conflicts at all - that is, resolve the conflicts by hand.
For future reference, this output was produced with the following source code revisions and tool versions:
$ $GTCORE/scripts/gt-version.sh 0.2.13-94833
$ svn info Path: /Users/smo036/langtech/main/langs/crk Working Copy Root Path: /Users/smo036/langtech/main URL: https://victorio.uit.no/langtech/trunk/langs/crk Repository Root: https://victorio.uit.no/langtech Repository UUID: c7155fb1-f0a7-4240-a2fc-2600b6f42f90 Revision: 94920 Node Kind: directory Schedule: normal Last Changed Author: lene Last Changed Rev: 94918 Last Changed Date: 2014-05-28 05:17:06 +0000 (ons, 28 mai 2014)
$ hfst-twolc --version
hfst-twolc 0 (hfst 3.7.0) Copyright (C) 2010 University of Helsinki, License GPLv3: GNU GPL version 3 http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
$ hfst-info No tests selected; printing known data HFST info version: 0.1 HFST packaging: hfst 3.7.0 HFST version: 3.7.0 HFST long version: 300070000 HFST configuration revision: $Revision: 3859 $ OpenFst supported SFST supported Unicode support: no (hfst)
$ twolc -v twolc-3.4.13 (2.25.11)
Date: 2014-05-28 09:24:37 +0200
From: Sjur Nørstebø Moshagen <
The Hfst bug is reported to the Hfst team as https://sourceforge.net/p/hfst/bugs/245/.
Date: 2014-05-28 09:31:56 +0200
From: Lene Antonsen <
(In reply to comment #7)
This is definitely caused by differences in conflict handling in twolc parsing between Hfst and Xerox. But I am not so sure that Xerox is to blame anymore. Here are the conflicts as detected by HFST:
Jeg minner om at når jeg kommenterer ut nesten alle twolregler (beholder en for kompileringa), og make clean før make:
Jeg har helt nye fst:er:
-rw-r--r-- 1 lan000 1907360568 105481 28 mai 01:29 src/generator-gt-norm.hfst -rw-r--r-- 1 lan000 1907360568 13633 28 mai 01:29 src/generator-gt-norm.xfst
Likevel: crk$ dcrk amisk+N+AN+Pl amisk+N+AN+Pl amiskak
^C crk$ hdcrk amisk+N+AN+Pl amisk+N+AN+Pl amisk 0,000000 amisk+N+AN+Pl amiskak 0,000000
Date: 2014-05-28 10:42:04 +0200
From: Sjur Nørstebø Moshagen <
(In reply to comment #9)
Jeg minner om at når jeg kommenterer ut nesten alle twolregler (beholder en for kompileringa), og make clean før make: [...] crk$ dcrk amisk+N+AN+Pl amisk+N+AN+Pl amiskak
^C crk$ hdcrk amisk+N+AN+Pl amisk+N+AN+Pl amisk 0,000000 amisk+N+AN+Pl amiskak 0,000000
Denne skilnaden kjem frå LexC utan at eg kan forklara kvifor:
$ hfst-lookup -q src/morphology/crk.lexc.hfst amisk+N+AN+Pl amisk+N+AN+Pl >amisk 0,000000 amisk+N+AN+Pl >amisk>ak 0,000000
amisk+N+AN+Sg
amisk+N+AN+Sg >amisk 0,000000
amisk+N+AN+Sg >amisk>ak 0,000000
$ lookup -q src/morphology/crk.lexc.xfst
amisk amisk amisk +N+AN+Sg
amisk>ak amisk>ak amisk +N+AN+Pl
Date: 2014-05-28 10:43:44 +0200
From: Sjur Nørstebø Moshagen <
(In reply to comment #10)
Denne skilnaden kjem frå LexC utan at eg kan forklara kvifor:
Er det flag-diakritika involvert i numerusbøyinga av amisk?
Date: 2014-05-28 11:16:14 +0200
From: Trond Trosterud <
Er det flag-diakritika involvert i numerusbøyinga av amisk?
stems/nouns.lexc:
LEXICON AN-IN @U.noun.abs@ STEMS ; < 0:n 0:i "@U.noun.1sg@" 0:"t2" > STEMS ; ! 1 < 0:k 0:i "@U.noun.2sg@" 0:"t2" > STEMS ; ! 2 ... LEXICON STEMS !! @LEXNAME@ add a affixmark and redirects to STEMLIST 0:%> STEMLIST ;
LEXICON STEMLIST !! @LEXNAME@ for nouns getting prefixes ni-, ki-, o- amisk ANimDECL "beaver" ; !yaml ...
Eventyret held fram i affixes/nouns.lexc:
LEXICON ANABSDECL !!= * @CODE@ for the animate absolute declension < "+N":0 "+AN":0 "+Sg":0 "@U.noun.abs@" > SG_ ; ! < "+N":0 "+AN":0 "@U.noun.abs@" > OBVIATIVE ; ! < "+N":0 "+AN":0 "+Pl":0 "@U.noun.abs@" > PLak ; ! < "+N":0 "+AN":0 "@U.noun.abs@" > LOC ; ! < "+N":i "+AN":n "@U.noun.abs@" > LOCahk ; !
Det fungerer slik:
alle nomen kan ha Px, og dei fleste kan ha absolutt (px-laus) böying. For å skilje har vi flagg. Så alle nomen (og verb, for den del) har flagg.
Date: 2014-05-29 19:10:54 +0200
From: Lene Antonsen <
I have been experimenting, and my theory is that hfst confuses the paths when the same diacritics is used in more than one path:
This lexicon remains the same in the experiments: LEXICON NONLOCahk !!= * @CODE@ for the animate absolute except LOC on ahk +Sg: SG ; OBVIATIVE ; +Pl: PLak ; LOC ;
1) LEXICON ANABSDECL !!= * @CODE@ for the animate absolute declension < "+N":0 "+AN":0 "@U.noun.abs@" > NON_LOCahk ; ! < "+N":i "+AN":n "@U.noun.abs@" > LOCahk ; !
crk$ hdcrk amisk+N+AN+Sg amisk+N+AN+Sg amisk 0,000000 amisk+N+AN+Sg amiskin 0,000000 <====-in comes from the other path in ANABSDECL lexicon!
crk$ dcrk amisk+N+AN+Sg amisk+N+AN+Sg amisk
2) LEXICON ANABSDECL !!= * @CODE@ for the animate absolute declension @U.noun.abs@ DECL ;
LEXICON DECL +N+AN: NON_LOCahk ; ! +N+AN:in LOCahk ; !
crk$ hdcrk amisk+N+AN+Sg amisk+N+AN+Sg amisk 0,000000
^C crk$ dcrk amisk+N+AN+Sg amisk+N+AN+Sg amisk
This does not happen in this lexicon, because all the same flag is not used in two paths: LEXICON ANSUFFSG !!= * @CODE@ < "+Px1Sg":%^POS "@U.noun.1sg@" > SG ; ! < "+Px2Sg":%^POS "@U.noun.2sg@" > SG ; ! < "+Px3Sg":%^POS "@U.noun.3sg@" > OBVIATIVE ; ! < "+Px4Sg":%^POS 0:i 0:y 0:i 0:w "@U.noun.3isg@" > OBVIATIVE ; ! < "+Px1Pl":%^POS 0:i 0:n 0:â 0:n "@U.noun.1pl@" > SG ; ! exclusive Pl -nân not -inân CHECK ?? <"+Px12Pl":%^POS 0:i 0:n 0:a 0:w "@U.noun.12pl@" > SG ; ! inclusive Pl CHECK OK? < "+Px2Pl":%^POS 0:i 0:w 0:â 0:w "@U.noun.2pl@" > SG ; ! < "+Px3Pl":%^POS 0:i 0:w 0:â 0:w "@U.noun.3pl@" > OBVIATIVE ; ! < "+Px4Pl":%^POS 0:i 0:y 0:i 0:w "@U.noun.3ipl@" > OBVIATIVE ; ! obviative plural possessor - Okimasis corrected
Date: 2014-05-31 17:31:53 +0200
From: Lene Antonsen <
To illustrate my theory the contlexis IICONJ and IICONJw are done differently for verbs in crk. Hfst overgenerates less for IICONJ than for IICONJw.
Try it out: cat test/data/VII-par.txt | sed 's/^/mihkwâw/' | hdcrk |l cat test/data/VII-par.txt | sed 's/^/mihkwâw/' | dcrk |l cat test/data/VII-par.txt | sed 's/^/miywâsin/' | dcrk |l cat test/data/VII-par.txt | sed 's/^/miywâsin/' | hdcrk |l
To get rid of all overgeneration for hfst, we cannot use the same prefix-lexicon for Cnj and Indep, because hfst doesn't allow the same diacr.flag for different paths. This is a bug in hfst, and should be fixed. If not, we have to make a prefixlexicon for each path, that means quite many. Till then, we'll use only xfst for generation and analysis.
Date: 2014-05-31 17:51:51 +0200
From: Lene Antonsen <
(In reply to comment #14)
Try it out: cat test/data/VII-par.txt | sed 's/^/mihkwâw/' | hdcrk |l cat test/data/VII-par.txt | sed 's/^/mihkwâw/' | dcrk |l cat test/data/VII-par.txt | sed 's/^/miywâsin/' | dcrk |l cat test/data/VII-par.txt | sed 's/^/miywâsin/' | hdcrk |l
or one can look at the yaml-tests for these to verbs which both pass with xfst, but not with hfst. Be aware of that the other yaml-tests for verbs in crk, are not corrected yet. Both tags-strings and wordforms have to be corrected.
Date: 2014-05-31 22:22:17 +0200
From: Lene Antonsen <
I've tried out different combinations of flags, ee. "@D.mood.cnj@" (dismiss), but it doesn't function with hfst.
Date: 2014-06-02 09:05:59 +0200
From: Sjur Nørstebø Moshagen <
The flag diacritics bug is now reported to the Hfst team as https://sourceforge.net/p/hfst/bugs/247/.
Date: 2014-06-02 09:18:21 +0200
From: Sjur Nørstebø Moshagen <
Summary for the Hfst team:
This bug report relating to Plains Cree seems to have detected two Hfst bugs:
Date: 2014-06-03 15:54:26 +0200
From: Sjur Nørstebø Moshagen <
Last comments from IRC:
[4:24pm] meriponi: i'm running: hfst-lookup --xfst=show-flags ././../../src/generator-gt-norm.hfst [4:24pm] meriponi: and get: [4:24pm] meriponi: > amisk+N+AN+Sg [4:24pm] meriponi: amisk+N+AN+Sg @U.noun.abs@amisk@U.noun.abs@ 0.000000 [4:24pm] meriponi: amisk+N+AN+Sg @U.noun.abs@amiskin@U.noun.abs@ 0.000000
(meriponi = one of the hfst guys)
Then from me:
[4:42pm] sjnomos: $ xfst -s src/generator-gt-norm.xfst [4:42pm] sjnomos: xfst[1]: set show-flags ON [4:42pm] sjnomos: variable show-flags = ON [4:42pm] sjnomos: xfst[1]: up amisk+N+AN+Sg [4:42pm] sjnomos: @U.noun.abs@amisk@U.noun.abs@
That is, correct behaviour with the flags in both fst's, but wrong flag in one case in hfst. The suspicion now goes to the LexC parser in hfst. The bug hunting continues...
Date: 2014-06-11 12:44:23 +0200
From: Sjur Nørstebø Moshagen <
This bug is fixed in the hfst code, both the lexc handling of flag diacritics with regex brackets <>, and the twolc inconsistencies compared to Xerox. That is, hfst transducers now behave as they should:
$ echo "amisk+N+AN+Pl" | hfst-lookup -p -q generator-raw-gt-desc.hfst amisk+N+AN+Pl amiskwak 0,000000
(no "amisk" anymore in the generated output)
Unfortunately, I have broken the hfst build in the new infra, so we still can't build the hfst transducers. I am working on that.
I mark this bug as fixed.
This issue was created automatically with bugzilla2github
Bugzilla Bug 1859
Date: 2014-04-30T05:39:27+02:00 From: Lene Antonsen <>
To: Sjur Nørstebø Moshagen <>
CC: lene.antonsen, thomas.omma, tommi.pirinen, trond.trosterud
Last updated: 2014-06-11T12:44:23+02:00