Closed albbas closed 16 years ago
Date: 2007-10-15 14:02:39 +0200
From: Thomas Omma <
in compounds the hyphenator treats the first letter of the second part as if it belongs to first part: (have inserted hash to make it easy to see)
viessom#g-uhkes viessom#v-uohkáj árvvo#v-uodo värált#á-rbbe åhpadus#o-rganisásjåvnån häjmma#d-áfo árbbe#d-áhpe barggo#v-uogijt rijka#d-ajva javlla#m-áno ássje#d-åbdde láhka#á-sadimesa giella#l-ágajn sáme#g-iellaj buorre#l-ágásj suoma#g-iella árbbe#d-ábálattjat
this seems to be the case in both hard-coded compounds and generated compounds
Julev-sáme, public beta 2
Date: 2007-10-15 14:44:40 +0200
From: Thomas Omma <
ålgusvaddema åvdåsvásstádus
these two words get hyphenated right. they are both hard-coded, with NO generated analyzes
Date: 2007-10-15 14:49:42 +0200
From: Sjur Nørstebø Moshagen <
What is strange is that the expected behaviour of the fall-back pattern should give correct hyphenation in many of the examples reported. That is, the last consonant in a consonant-group in front of a vowel should come after the hyphenation point.
This seems to point to a bug in our hyphenation transducers somewhere. At least we need to figure out what comes out of the PLX transducers.
Date: 2007-10-15 14:54:06 +0200
From: Thomas Omma <
láhka-tektsta
this compound gets hyhpenated right because it is not spelled right! As soon as I correct the word, it gets hyphenated the same way as the others
láhka#t-æksta
Date: 2007-10-15 15:22:26 +0200
From: Thomas Omma <
here the phenomenon is reversed: sierr-a#láhkáj
Date: 2007-10-16 12:40:34 +0200
From: Thomas Omma <
servodat#b-erošteaddji olgo#b-áikkis sáme#g-illii má#i-lmmi sátne#g-ovat moraš#l-uohti Justis#l-ávdegoddi Sáme#d-ikkiin lotnolas#e-aláhussan
and the odd reversed type: olggo-s#addán
Davvis-ámi, public beta 2, 2007-10-11
Date: 2007-10-17 09:34:30 +0200
From: Sjur Nørstebø Moshagen <
(In reply to comment #3)
láhka-tektsta
this compound gets hyhpenated right because it is not spelled right! As soon as I correct the word, it gets hyphenated the same way as the others
láhka#t-æksta
The fact that the hyphenation goes wrong when the spelling is correct, points to either wrong hyphenation points in the PLX entries, or to a bug in the Polderland code.
The hyphenation lexicon is exactly the same as the speller lexicon, and if the correctly spelled word is recognised by the speller, it should also be recognised by the hyphenator, including dynamic (generated) compounds.
The next step is thus to identify the PLX entry/-ies for this/these word(s), and if they are correct, including correct hyphenation points, we need to forward the issue to Polderland.
So Tomi, could you have a look at this, and find the PLX entries involved?
Date: 2007-10-22 20:44:33 +0200
From: Sjur Nørstebø Moshagen <
We have made several observations of the hyphenation module that points to a bug in the Polderland code. Basically, it looks like the hyphenator prefers dynamic compounds, and that these are consequenctly hyphenated one char to the right of the word boundary, as seen in the examples in the original bug report.
To illustrate, we have studied the word 'láhkatæksta' ('law text') in detail. In the latest speller, it should be recognised both as a lexicalised compound, and as a dynamic compound. The starting point is the following lexical entry from our Xerox format source file:
láhka#tæksta MUORRA ;
This gives the following PLX entries relevant for this case:
láh^ka#tæks^tam NIR láh^ka#tæks^tat NIR láh^ka#tæks^tan NIR láh^ka#tæks^taj NIR láh^ka#tæks^ta NIR <=== láh^ka#tæks^ta- NALX láh^ka#tæks^ta- NIAL láh^ka#tæks^ta NAL <=== láh^ka#tæks^tas NIR
(^ = hyphenation point, both ^ and # are converted to - before PLX sorting and lexicon compilation, and - is converted to --)
Just to check that the dynamic compound follows the same pattern, we also checked the PLX entries of the parts:
tæks^ta NIR <=== tæks^ta- NALX tæks^ta- NIAL tæks^ta NAL tæks^tas NIR
láh^kaj NIR láh^ka NIR láh^ka- NALX láh^kam NIR láh^kat NIR láh^kan NIR láh^ka- NIAL láh^ka NAL <=== láh^kas NIR láh^kas^ka NIR
So far so good, and everything i consistent and as it should be. Then comes the Word output:
Julev-sáme, public beta 2, 2007-10-16:
láh-kat-æks-ta <= correctly spelled lah-katæk-sta <= one misspelling láh-ka-tek-sta <= another misspelling, which gets correctly hyphenated!
Julev-sáme, public beta 2, 2007-10-19:
láh-kat-ækst-a
Also, it seems that this problem is related to dynamic compounds only, cf comment 1, where there are no dynamic compound alternatives. In these cases, the hyphenation is correct.
That is, as long as there are only lexicalised alternatives (no dynamic compounding), the hyphenation seems to be mostly correct, but as soon as it is possible to analyse a word form as a dynamic compound, the hyphenation goes wrong at the compound border, even though there exist a lexicalised compound as an alternative.
All examples in the original report are of the latter type, ie they can be analysed as dynamic compounds, but they also exists as lexicalised compounds.
Misspellings generally are hyphenated correctly. That points to a good fall-back, pattern-based hyphenator.
The oldest of the spellers tested above can be downloaded from here:
http://www.divvun.no/static_files/sami-proofing-tools-20071018.dmg http://www.divvun.no/static_files/sami-proofing-tools-20071018.zip
The newest speller tested is available here:
http://www.divvun.no/static_files/sami-proofing-tools-20071022.dmg http://www.divvun.no/static_files/sami-proofing-tools-20071022.zip
Date: 2007-11-29 21:34:14 +0100
From: Sjur Nørstebø Moshagen <
This one is fixed with the latest deliveries from Polderland.
Date: 2008-01-22 11:38:35 +0100
From: Sjur Nørstebø Moshagen <
lotnolasealáhussan olggosaddán
are still broken, thus reopening this bug.
See test report at:
http://www.divvun.no/doc/proof/spelling/testing/hyph-regression-pl-forrest-sme-20080122.html
for details.
(The word boundary can be found in Comment #5)
Date: 2008-01-24 12:53:11 +0100
From: Tomi Pieski <
Normative hyphenation fst gives two hyphenation points in generation:
Tomi-si-maskin:gt tomi$ lookup -flags mbTT -utf8 sme/bin/hisme-norm.fst 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% lotnolas#ealáhus+N+Ess lotnolas#ealáhus+N+Ess lot^no^las#ea^lá^hus^san lotnolas#ealáhus+N+Ess lot^no^la^sea^lá^hus^san
lotnolas#ealáhus+N+Sg+Ill lotnolas#ealáhus+N+Sg+Ill lot^no^las#ea^lá^hus^sii lotnolas#ealáhus+N+Sg+Ill lot^no^la^sea^lá^hus^sii
Hyphrules fst gives right hyphenation when it receives word boundary:
lotnolasealáhussan lotnolasealáhussan lot^no^la^sea^lá^hus^san
lotnolas#ealáhussan lotnolas#ealáhussan lot^no^las#ea^lá^hus^san
So, somewhere the word boundary is optionally removed. I suspect twolc rules, because hyph-isme.save compiles sources and twol-hyph-sme.bin and it gives:
Tomi-si-maskin:gt tomi$ lookup -flags mbTT -utf8 sme/bin/hyph-isme.save 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% lotnolas#ealáhus+N+Ess lotnolas#ealáhus+N+Ess lotnolas#ealáhussan lotnolas#ealáhus+N+Ess lotnolasealáhussan
It optionally removes the word boundary. I didn't find anything from twolc rules that could remove '#' character.
Date: 2008-01-24 13:42:31 +0100
From: Sjur Nørstebø Moshagen <
What are the relevant PLX entries?
Date: 2008-01-24 14:00:15 +0100
From: Tomi Pieski <
Created attachment 81 Plx entries
Attached file: clipboard.txt (text/plain, 6917 bytes) Description: Plx entries
Date: 2008-01-24 14:00:48 +0100
From: Tomi Pieski <
Added an attachment. Hopefully it works.
Date: 2008-01-24 14:28:30 +0100
From: Sjur Nørstebø Moshagen <
Thanks for the attachment, it worked fine:)
It shows two different and conflicting patterns:
lot^no^las#ea^lá^hus NIE lot^no^la^sea^lá^hus NIE
The first one should give correct hyphenation, whereas the second one will give the wrong pattern identified in the test.
When the ^s and #s are converted to -s, there is no way for the hyphenator to distinguish between the two variants (ie pick the better one, with "better" meaning the one with a word boundary - there is no word boundary there anymore, only a soft hyphen).
In choosing between the two, the hyphenator can pick either of them. The bug is really on our side, we should not have generated the second variant.
Tomi, can you give a similar PLX listing for the other failing word in this bug, "olggosaddán"?
Date: 2008-01-24 14:48:20 +0100
From: Tomi Pieski <
This has the same pattern as with 'lotnolasealáhus':
olg^go^sad^dán NpIE olg^go^sad^dán NaIE olg^go^sad^dán NIE olg^gos#ad^dán NpIE olg^gos#ad^dán NaIE olg^gos#ad^dán NIE
Date: 2008-01-24 16:10:46 +0100
From: Sjur Nørstebø Moshagen <
The PLX entries shows that the analysis given in Comment #10 is correct. We need to find the offending optional #->0 rule.
There does not seem to be a case for Polderland in this bug, at least not until we have fixed ours.
Date: 2008-01-24 19:29:36 +0100
From: Tomi Pieski <
I was able to get only one surface string from hyph-isme.fst by changing all '#:' occurences to '#'. I will commit twolc file.
Date: 2008-01-24 19:37:16 +0100
From: Sjur Nørstebø Moshagen <
That is a dangerous move - it will make all #s obligatory on the surface, not exactly what we want...
That is, we want it in a hyphenation context (eg when making the PLX transducers), but not otherwise. If this is the only option, then we have a serious problem. It should not be necessary.
Date: 2008-01-25 15:19:43 +0100
From: Sjur Nørstebø Moshagen <
The link in Comment #9 isn't valid anymore, use this link instead:
http://www.divvun.no/doc/proof/hyph/testing/hy-regression-pl-forrest-smj-20080123.html
Date: 2008-01-29 09:08:22 +0100
From: Sjur Nørstebø Moshagen <
This bug has been fixed in the latest lexicon:
Davvisámi, version 1.0.1, 2008-01-28
This issue was created automatically with bugzilla2github
Bugzilla Bug 545
Date: 2007-10-15T14:02:39+02:00 From: Thomas Omma <>
To: Tomi Pieski <>
CC: pbeinema, sjur.n.moshagen, tomi.k.pieski, trond.trosterud
Last updated: 2008-01-29T09:08:22+01:00