Closed albbas closed 12 years ago
Date: 2012-11-02 19:43:25 +0100
From: Sjur Nørstebø Moshagen <
There's something fishy about the new infra when it comes to Greenlandic.
In the old infra, both hfst and xerox fst's are working fine. In the new infra, both hfst and xerox fst's cause a segmentation fault when analysing words. The source code is exactly the same.
The raw transducer (src/analyser-raw-desc.xfst) opens in xfst, and print random-* works fine.
Date: 2012-11-05 12:17:22 +0100
From: Sjur Nørstebø Moshagen <
Ok, here are some more details:
Printing random upper-lower pairs works fine:
kal sjur$ hfst-fst2strings -r 20 src/analyser-raw-gt-desc.hfst
&+PUNCT:&
'+PUNCT:'
;+CLB:;
AAU+SUAQ+N+Abs+Sg+LI:AAUersuarli
F+N+Abbr:F
KGÆÖÆ2%+Num+Abs:KGÆÖÆ2%
Lyrskovgade+LIAR+nv+VIP=GUNNAIR+vv+V+Con+3Pl:LyrskovgdePROPlirvikkunnaarc2pt
N9%+Num+RIAR+SSAAR=VIP=SUR+vv+RIAQ+vn+USAAQ+nn+GE=GALUAQ+nv+N+Abs+Sg+1SgPoss+TTAARLU:N9%-erirssaavissorriaqusaqg2igluqg2aCLITttaarlu
TA+una+DemAdv+Via+Sg+LIUNA:tssuunCLITliuna
TA+innga+DemInterj:taka
UQFSÄ+N+ACR+Abs:UQFSÄ
ta+Interj+UKU:nuku
una+DemInterj:uffa
}+PUNCT+RIGHT:}
Ä7812+Num+RIAAT+LIRI+nv+RIAR=SINNAA=NNGIT+vv+GALUAQ+vn+SUNNIP+nv+LLAQQIP=NIRAR+vv+PALUK+vn+VIK=SUAQ+nn+TUR+nv+LLUAR=TIGE+vv+NIRPAAQ+vn+U+nv+ALLAP=GALUAR+vv+TUQ+vn+N+Abs+Sg+LUGOOQ:Ä7812-eritsereriarsinnnngikkluqsunnipllaqqinnerrpalukvissuaqtorluartiginerpallkkaluartorlugooq
ÆQ7+Num+Trm+AASIILLU:ÆQ7-inunCLITasiillu
ø+N+Abbr:ø.
ø+N+Abbr:ø
Running lookup on the raw analyser also works fine, it seems (please note that the raw HFST transducer is a generator, not an analyzer):
$ hfst-lookup src/analyser-raw-gt-desc.hfst una+DemInterj una+DemInterj tss 0,000000 una+DemInterj tssa 0,000000 una+DemInterj tass 0,000000 una+DemInterj tassa 0,000000 una+DemInterj uff 0,000000 una+DemInterj uffa 0,000000
TA+innga+DemInterj TA+innga+DemInterj tk 0,000000 TA+innga+DemInterj tka 0,000000 TA+innga+DemInterj tak 0,000000 TA+innga+DemInterj taka 0,000000 TA+innga+DemInterj tak 0,000000 TA+innga+DemInterj taka 0,000000 TA+innga+DemInterj taak 0,000000 TA+innga+DemInterj taaka 0,000000
No segmentation fault, but a lot of spurious (?) strings that probably should be cleaned up. The output seems to indicate that there are a number of too powerful rewrite rules, generating too many word forms, and potentially infinite recursion (which could explain the segmentation fault).
Running lookup on the standard (descriptive and normative) transducers triggers the segmentation fault:
kal sjur$ hfst-lookup src/analyser-gt-desc.hfst illu Segmentation fault: 11
kal sjur$ hfst-lookup src/analyser-gt-desc.hfst taka Segmentation fault: 11
kal sjur$ hfst-lookup src/analyser-gt-norm.hfst taka Segmentation fault: 11
The generator seems to behave more reasonably, although not as good as the raw transducer:
kal sjur$ hfst-lookup src/generator-gt-desc.hfst TA+innga+DemInterj TA+innga+DemInterj TA+innga+DemInterj+? inf
una+DemInterj una+DemInterj tss 0,000000 una+DemInterj tssa 0,000000 una+DemInterj tass 0,000000 una+DemInterj tassa 0,000000 una+DemInterj uff 0,000000 una+DemInterj uffa 0,000000
At least it doesn't seg-fault.
The xerox transducers are not behaving so nicely:
kal sjur$ lookup -flags mbTT src/analyser-raw-gt-desc.xfst 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% TA+innga+DemInterj TA+innga+DemInterj TA+innga+DemInterj +?
una+DemInterj una+DemInterj una+DemInterj +?
taka Segmentation fault: 11
The generator is a bit kinder:
kal sjur$ lookup -flags mbTT src/generator-gt-desc.xfst 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% TA+innga+DemInterj TA+innga+DemInterj taaka TA+innga+DemInterj taak TA+innga+DemInterj taka TA+innga+DemInterj tak TA+innga+DemInterj taka TA+innga+DemInterj tak TA+innga+DemInterj tka TA+innga+DemInterj tk
una+DemInterj una+DemInterj uffa una+DemInterj uff una+DemInterj tassa una+DemInterj tass una+DemInterj tssa una+DemInterj tss
ta+Interj+UKU ta+Interj+UKU taCLITaku ta+Interj+UKU taCLITuku ta+Interj+UKU taaku ta+Interj+UKU tauku ta+Interj+UKU nCLITuku ta+Interj+UKU nuku
No segmentation fault.
But it definitely displays the same overgeneration issues as the HFST transducers.
Summary: there seems to be issues with the greenlandic transducers (cf overgenation problems), but also potential issues with the interaction of the default Divvun/GT filters and the greenlandic transducers. In addition, there are certainly things that both hfst and xerox could improve - they should never segfault :)
Date: 2012-11-06 07:10:51 +0100
From: Sjur Nørstebø Moshagen <
This is fixed in r 65012 by Tommi by switching from the twolc to the xfscript file for the (morpho)phonology. It also removed most of the over-generation:
kal sjur$ hfst-lookup src/generator-gt-desc.hfst TA+innga+DemInterj TA+innga+DemInterj TA+innga+DemInterj+? inf
una+DemInterj una+DemInterj tassa 0,000000 una+DemInterj uffa 0,000000
ta+Interj+UKU ta+Interj+UKU taaku 0,000000
innga+DemInterj innga+DemInterj ika 0,000000
^C kal sjur$ hfst-lookup src/analyser-gt-desc.hfst tassa tassa tassa+Interj 0,000000 tassa tassa+part 0,000000 tassa una+DemInterj 0,000000
uffa uffa uffa+Interj 0,000000 uffa una+DemInterj 0,000000
taaku taaku ta+Interj+UKU 0,000000
ika ika innga+DemInterj 0,000000
$ lookup -flags mbTT src/analyser-gt-desc.xfst 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% illu illu ih+Interj+LU illu illu+N+Abs+Sg
There are still a couple of multichar symbols on the surface side:
Ignored symbol @_EPSILONSYMBOL@ Ignored symbol AA Ignored symbol TA
These must be looked into (only the two last ones).
This issue was created automatically with bugzilla2github
Bugzilla Bug 1499
Date: 2012-11-02T19:43:25+01:00 From: Sjur Nørstebø Moshagen <>
To: Sjur Nørstebø Moshagen <>
CC: pela, sjur.n.moshagen, tommi.pirinen, trond.trosterud
Last updated: 2012-11-06T07:10:51+01:00