giellalt / bugzilla-dummy

0 stars 0 forks source link

Greenlandic fst's cause segmentation faults in new infra (Bugzilla Bug 1499) #1436

Closed albbas closed 12 years ago

albbas commented 12 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 1499

Date: 2012-11-02T19:43:25+01:00 From: Sjur Nørstebø Moshagen <> To: Sjur Nørstebø Moshagen <> CC: pela, sjur.n.moshagen, tommi.pirinen, trond.trosterud

Last updated: 2012-11-06T07:10:51+01:00

albbas commented 12 years ago

Comment 7289

Date: 2012-11-02 19:43:25 +0100 From: Sjur Nørstebø Moshagen <>

There's something fishy about the new infra when it comes to Greenlandic.

In the old infra, both hfst and xerox fst's are working fine. In the new infra, both hfst and xerox fst's cause a segmentation fault when analysing words. The source code is exactly the same.

The raw transducer (src/analyser-raw-desc.xfst) opens in xfst, and print random-* works fine.

albbas commented 12 years ago

Comment 7300

Date: 2012-11-05 12:17:22 +0100 From: Sjur Nørstebø Moshagen <>

Ok, here are some more details:

Printing random upper-lower pairs works fine:

kal sjur$ hfst-fst2strings -r 20 src/analyser-raw-gt-desc.hfst
&+PUNCT:& '+PUNCT:' ;+CLB:; AAU+SUAQ+N+Abs+Sg+LI:AAUersuarli F+N+Abbr:F KGÆÖÆ2%+Num+Abs:KGÆÖÆ2% Lyrskovgade+LIAR+nv+VIP=GUNNAIR+vv+V+Con+3Pl:LyrskovgdePROPlirvikkunnaarc2pt N9%+Num+RIAR+SSAAR=VIP=SUR+vv+RIAQ+vn+USAAQ+nn+GE=GALUAQ+nv+N+Abs+Sg+1SgPoss+TTAARLU:N9%-erirssaavissorriaqusaqg2igluqg2aCLITttaarlu TA+una+DemAdv+Via+Sg+LIUNA:tssuunCLITliuna TA+innga+DemInterj:taka UQFSÄ+N+ACR+Abs:UQFSÄ ta+Interj+UKU:nuku una+DemInterj:uffa }+PUNCT+RIGHT:} Ä7812+Num+RIAAT+LIRI+nv+RIAR=SINNAA=NNGIT+vv+GALUAQ+vn+SUNNIP+nv+LLAQQIP=NIRAR+vv+PALUK+vn+VIK=SUAQ+nn+TUR+nv+LLUAR=TIGE+vv+NIRPAAQ+vn+U+nv+ALLAP=GALUAR+vv+TUQ+vn+N+Abs+Sg+LUGOOQ:Ä7812-eritsereriarsinnnngikkluqsunnipllaqqinnerrpalukvissuaqtorluartiginerpallkkaluartorlugooq ÆQ7+Num+Trm+AASIILLU:ÆQ7-inunCLITasiillu ø+N+Abbr:ø. ø+N+Abbr:ø

Running lookup on the raw analyser also works fine, it seems (please note that the raw HFST transducer is a generator, not an analyzer):

$ hfst-lookup src/analyser-raw-gt-desc.hfst una+DemInterj una+DemInterj tss 0,000000 una+DemInterj tssa 0,000000 una+DemInterj tass 0,000000 una+DemInterj tassa 0,000000 una+DemInterj uff 0,000000 una+DemInterj uffa 0,000000

TA+innga+DemInterj TA+innga+DemInterj tk 0,000000 TA+innga+DemInterj tka 0,000000 TA+innga+DemInterj tak 0,000000 TA+innga+DemInterj taka 0,000000 TA+innga+DemInterj tak 0,000000 TA+innga+DemInterj taka 0,000000 TA+innga+DemInterj taak 0,000000 TA+innga+DemInterj taaka 0,000000

No segmentation fault, but a lot of spurious (?) strings that probably should be cleaned up. The output seems to indicate that there are a number of too powerful rewrite rules, generating too many word forms, and potentially infinite recursion (which could explain the segmentation fault).

Running lookup on the standard (descriptive and normative) transducers triggers the segmentation fault:

kal sjur$ hfst-lookup src/analyser-gt-desc.hfst illu Segmentation fault: 11

kal sjur$ hfst-lookup src/analyser-gt-desc.hfst taka Segmentation fault: 11

kal sjur$ hfst-lookup src/analyser-gt-norm.hfst taka Segmentation fault: 11

The generator seems to behave more reasonably, although not as good as the raw transducer:

kal sjur$ hfst-lookup src/generator-gt-desc.hfst TA+innga+DemInterj TA+innga+DemInterj TA+innga+DemInterj+? inf

una+DemInterj una+DemInterj tss 0,000000 una+DemInterj tssa 0,000000 una+DemInterj tass 0,000000 una+DemInterj tassa 0,000000 una+DemInterj uff 0,000000 una+DemInterj uffa 0,000000

At least it doesn't seg-fault.

The xerox transducers are not behaving so nicely:

kal sjur$ lookup -flags mbTT src/analyser-raw-gt-desc.xfst 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% TA+innga+DemInterj TA+innga+DemInterj TA+innga+DemInterj +?

una+DemInterj una+DemInterj una+DemInterj +?

taka Segmentation fault: 11

The generator is a bit kinder:

kal sjur$ lookup -flags mbTT src/generator-gt-desc.xfst 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% TA+innga+DemInterj TA+innga+DemInterj taaka TA+innga+DemInterj taak TA+innga+DemInterj taka TA+innga+DemInterj tak TA+innga+DemInterj taka TA+innga+DemInterj tak TA+innga+DemInterj tka TA+innga+DemInterj tk

una+DemInterj una+DemInterj uffa una+DemInterj uff una+DemInterj tassa una+DemInterj tass una+DemInterj tssa una+DemInterj tss

ta+Interj+UKU ta+Interj+UKU taCLITaku ta+Interj+UKU taCLITuku ta+Interj+UKU taaku ta+Interj+UKU tauku ta+Interj+UKU nCLITuku ta+Interj+UKU nuku

No segmentation fault.

But it definitely displays the same overgeneration issues as the HFST transducers.

Summary: there seems to be issues with the greenlandic transducers (cf overgenation problems), but also potential issues with the interaction of the default Divvun/GT filters and the greenlandic transducers. In addition, there are certainly things that both hfst and xerox could improve - they should never segfault :)

albbas commented 12 years ago

Comment 7309

Date: 2012-11-06 07:10:51 +0100 From: Sjur Nørstebø Moshagen <>

This is fixed in r 65012 by Tommi by switching from the twolc to the xfscript file for the (morpho)phonology. It also removed most of the over-generation:

kal sjur$ hfst-lookup src/generator-gt-desc.hfst TA+innga+DemInterj TA+innga+DemInterj TA+innga+DemInterj+? inf

una+DemInterj una+DemInterj tassa 0,000000 una+DemInterj uffa 0,000000

ta+Interj+UKU ta+Interj+UKU taaku 0,000000

innga+DemInterj innga+DemInterj ika 0,000000

^C kal sjur$ hfst-lookup src/analyser-gt-desc.hfst tassa tassa tassa+Interj 0,000000 tassa tassa+part 0,000000 tassa una+DemInterj 0,000000

uffa uffa uffa+Interj 0,000000 uffa una+DemInterj 0,000000

taaku taaku ta+Interj+UKU 0,000000

ika ika innga+DemInterj 0,000000

$ lookup -flags mbTT src/analyser-gt-desc.xfst 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% illu illu ih+Interj+LU illu illu+N+Abs+Sg

There are still a couple of multichar symbols on the surface side:

Ignored symbol @_EPSILONSYMBOL@ Ignored symbol AA Ignored symbol TA

These must be looked into (only the two last ones).