giellalt / bugzilla-dummy

0 stars 0 forks source link

bad hyphenation in compounds (Bugzilla Bug 545) #1656

Closed albbas closed 16 years ago

albbas commented 17 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 545

Date: 2007-10-15T14:02:39+02:00 From: Thomas Omma <> To: Tomi Pieski <> CC: pbeinema, sjur.n.moshagen, tomi.k.pieski, trond.trosterud

Last updated: 2008-01-29T09:08:22+01:00

albbas commented 17 years ago

Comment 2094

Date: 2007-10-15 14:02:39 +0200 From: Thomas Omma <>

in compounds the hyphenator treats the first letter of the second part as if it belongs to first part: (have inserted hash to make it easy to see)

viessom#g-uhkes viessom#v-uohkáj árvvo#v-uodo värált#á-rbbe åhpadus#o-rganisásjåvnån häjmma#d-áfo árbbe#d-áhpe barggo#v-uogijt rijka#d-ajva javlla#m-áno ássje#d-åbdde láhka#á-sadimesa giella#l-ágajn sáme#g-iellaj buorre#l-ágásj suoma#g-iella árbbe#d-ábálattjat

this seems to be the case in both hard-coded compounds and generated compounds

Julev-sáme, public beta 2

albbas commented 17 years ago

Comment 2097

Date: 2007-10-15 14:44:40 +0200 From: Thomas Omma <>

ålgusvaddema åvdåsvásstádus

these two words get hyphenated right. they are both hard-coded, with NO generated analyzes

albbas commented 17 years ago

Comment 2099

Date: 2007-10-15 14:49:42 +0200 From: Sjur Nørstebø Moshagen <>

What is strange is that the expected behaviour of the fall-back pattern should give correct hyphenation in many of the examples reported. That is, the last consonant in a consonant-group in front of a vowel should come after the hyphenation point.

This seems to point to a bug in our hyphenation transducers somewhere. At least we need to figure out what comes out of the PLX transducers.

albbas commented 17 years ago

Comment 2100

Date: 2007-10-15 14:54:06 +0200 From: Thomas Omma <>

láhka-tektsta

this compound gets hyhpenated right because it is not spelled right! As soon as I correct the word, it gets hyphenated the same way as the others

láhka#t-æksta

albbas commented 17 years ago

Comment 2101

Date: 2007-10-15 15:22:26 +0200 From: Thomas Omma <>

here the phenomenon is reversed: sierr-a#láhkáj

albbas commented 17 years ago

Comment 2114

Date: 2007-10-16 12:40:34 +0200 From: Thomas Omma <>

servodat#b-erošteaddji olgo#b-áikkis sáme#g-illii má#i-lmmi sátne#g-ovat moraš#l-uohti Justis#l-ávdegoddi Sáme#d-ikkiin lotnolas#e-aláhussan

and the odd reversed type: olggo-s#addán

Davvis-ámi, public beta 2, 2007-10-11

albbas commented 17 years ago

Comment 2122

Date: 2007-10-17 09:34:30 +0200 From: Sjur Nørstebø Moshagen <>

(In reply to comment #3)

láhka-tektsta

this compound gets hyhpenated right because it is not spelled right! As soon as I correct the word, it gets hyphenated the same way as the others

láhka#t-æksta

The fact that the hyphenation goes wrong when the spelling is correct, points to either wrong hyphenation points in the PLX entries, or to a bug in the Polderland code.

The hyphenation lexicon is exactly the same as the speller lexicon, and if the correctly spelled word is recognised by the speller, it should also be recognised by the hyphenator, including dynamic (generated) compounds.

The next step is thus to identify the PLX entry/-ies for this/these word(s), and if they are correct, including correct hyphenation points, we need to forward the issue to Polderland.

So Tomi, could you have a look at this, and find the PLX entries involved?

albbas commented 17 years ago

Comment 2146

Date: 2007-10-22 20:44:33 +0200 From: Sjur Nørstebø Moshagen <>

We have made several observations of the hyphenation module that points to a bug in the Polderland code. Basically, it looks like the hyphenator prefers dynamic compounds, and that these are consequenctly hyphenated one char to the right of the word boundary, as seen in the examples in the original bug report.

To illustrate, we have studied the word 'láhkatæksta' ('law text') in detail. In the latest speller, it should be recognised both as a lexicalised compound, and as a dynamic compound. The starting point is the following lexical entry from our Xerox format source file:

láhka#tæksta MUORRA ;

= word boundary

This gives the following PLX entries relevant for this case:

láh^ka#tæks^tam NIR láh^ka#tæks^tat NIR láh^ka#tæks^tan NIR láh^ka#tæks^taj NIR láh^ka#tæks^ta NIR <=== láh^ka#tæks^ta- NALX láh^ka#tæks^ta- NIAL láh^ka#tæks^ta NAL <=== láh^ka#tæks^tas NIR

(^ = hyphenation point, both ^ and # are converted to - before PLX sorting and lexicon compilation, and - is converted to --)

Just to check that the dynamic compound follows the same pattern, we also checked the PLX entries of the parts:

tæks^ta NIR <=== tæks^ta- NALX tæks^ta- NIAL tæks^ta NAL tæks^tas NIR

láh^kaj NIR láh^ka NIR láh^ka- NALX láh^kam NIR láh^kat NIR láh^kan NIR láh^ka- NIAL láh^ka NAL <=== láh^kas NIR láh^kas^ka NIR

So far so good, and everything i consistent and as it should be. Then comes the Word output:

Julev-sáme, public beta 2, 2007-10-16:

láh-kat-æks-ta <= correctly spelled lah-katæk-sta <= one misspelling láh-ka-tek-sta <= another misspelling, which gets correctly hyphenated!

Julev-sáme, public beta 2, 2007-10-19:

láh-kat-ækst-a

Also, it seems that this problem is related to dynamic compounds only, cf comment 1, where there are no dynamic compound alternatives. In these cases, the hyphenation is correct.

That is, as long as there are only lexicalised alternatives (no dynamic compounding), the hyphenation seems to be mostly correct, but as soon as it is possible to analyse a word form as a dynamic compound, the hyphenation goes wrong at the compound border, even though there exist a lexicalised compound as an alternative.

All examples in the original report are of the latter type, ie they can be analysed as dynamic compounds, but they also exists as lexicalised compounds.

Misspellings generally are hyphenated correctly. That points to a good fall-back, pattern-based hyphenator.

The oldest of the spellers tested above can be downloaded from here:

http://www.divvun.no/static_files/sami-proofing-tools-20071018.dmg http://www.divvun.no/static_files/sami-proofing-tools-20071018.zip

The newest speller tested is available here:

http://www.divvun.no/static_files/sami-proofing-tools-20071022.dmg http://www.divvun.no/static_files/sami-proofing-tools-20071022.zip

albbas commented 16 years ago

Comment 2277

Date: 2007-11-29 21:34:14 +0100 From: Sjur Nørstebø Moshagen <>

This one is fixed with the latest deliveries from Polderland.

albbas commented 16 years ago

Comment 2467

Date: 2008-01-22 11:38:35 +0100 From: Sjur Nørstebø Moshagen <>

lotnolasealáhussan olggosaddán

are still broken, thus reopening this bug.

See test report at:

http://www.divvun.no/doc/proof/spelling/testing/hyph-regression-pl-forrest-sme-20080122.html

for details.

(The word boundary can be found in Comment #5)

albbas commented 16 years ago

Comment 2487

Date: 2008-01-24 12:53:11 +0100 From: Tomi Pieski <>

Normative hyphenation fst gives two hyphenation points in generation:

Tomi-si-maskin:gt tomi$ lookup -flags mbTT -utf8 sme/bin/hisme-norm.fst 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% lotnolas#ealáhus+N+Ess lotnolas#ealáhus+N+Ess lot^no^las#ea^lá^hus^san lotnolas#ealáhus+N+Ess lot^no^la^sea^lá^hus^san

lotnolas#ealáhus+N+Sg+Ill lotnolas#ealáhus+N+Sg+Ill lot^no^las#ea^lá^hus^sii lotnolas#ealáhus+N+Sg+Ill lot^no^la^sea^lá^hus^sii

Hyphrules fst gives right hyphenation when it receives word boundary:

lotnolasealáhussan lotnolasealáhussan lot^no^la^sea^lá^hus^san

lotnolas#ealáhussan lotnolas#ealáhussan lot^no^las#ea^lá^hus^san

So, somewhere the word boundary is optionally removed. I suspect twolc rules, because hyph-isme.save compiles sources and twol-hyph-sme.bin and it gives:

Tomi-si-maskin:gt tomi$ lookup -flags mbTT -utf8 sme/bin/hyph-isme.save 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% lotnolas#ealáhus+N+Ess lotnolas#ealáhus+N+Ess lotnolas#ealáhussan lotnolas#ealáhus+N+Ess lotnolasealáhussan

It optionally removes the word boundary. I didn't find anything from twolc rules that could remove '#' character.

albbas commented 16 years ago

Comment 2494

Date: 2008-01-24 13:42:31 +0100 From: Sjur Nørstebø Moshagen <>

What are the relevant PLX entries?

albbas commented 16 years ago

Comment 2495

Date: 2008-01-24 14:00:15 +0100 From: Tomi Pieski <>

Created attachment 81 Plx entries

Attached file: clipboard.txt (text/plain, 6917 bytes) Description: Plx entries

albbas commented 16 years ago

Comment 2496

Date: 2008-01-24 14:00:48 +0100 From: Tomi Pieski <>

Added an attachment. Hopefully it works.

albbas commented 16 years ago

Comment 2497

Date: 2008-01-24 14:28:30 +0100 From: Sjur Nørstebø Moshagen <>

Thanks for the attachment, it worked fine:)

It shows two different and conflicting patterns:

lot^no^las#ea^lá^hus NIE lot^no^la^sea^lá^hus NIE

The first one should give correct hyphenation, whereas the second one will give the wrong pattern identified in the test.

When the ^s and #s are converted to -s, there is no way for the hyphenator to distinguish between the two variants (ie pick the better one, with "better" meaning the one with a word boundary - there is no word boundary there anymore, only a soft hyphen).

In choosing between the two, the hyphenator can pick either of them. The bug is really on our side, we should not have generated the second variant.

Tomi, can you give a similar PLX listing for the other failing word in this bug, "olggosaddán"?

albbas commented 16 years ago

Comment 2498

Date: 2008-01-24 14:48:20 +0100 From: Tomi Pieski <>

This has the same pattern as with 'lotnolasealáhus':

olg^go^sad^dán NpIE olg^go^sad^dán NaIE olg^go^sad^dán NIE olg^gos#ad^dán NpIE olg^gos#ad^dán NaIE olg^gos#ad^dán NIE

albbas commented 16 years ago

Comment 2499

Date: 2008-01-24 16:10:46 +0100 From: Sjur Nørstebø Moshagen <>

The PLX entries shows that the analysis given in Comment #10 is correct. We need to find the offending optional #->0 rule.

There does not seem to be a case for Polderland in this bug, at least not until we have fixed ours.

albbas commented 16 years ago

Comment 2501

Date: 2008-01-24 19:29:36 +0100 From: Tomi Pieski <>

I was able to get only one surface string from hyph-isme.fst by changing all '#:' occurences to '#'. I will commit twolc file.

albbas commented 16 years ago

Comment 2502

Date: 2008-01-24 19:37:16 +0100 From: Sjur Nørstebø Moshagen <>

That is a dangerous move - it will make all #s obligatory on the surface, not exactly what we want...

That is, we want it in a hyphenation context (eg when making the PLX transducers), but not otherwise. If this is the only option, then we have a serious problem. It should not be necessary.

albbas commented 16 years ago

Comment 2506

Date: 2008-01-25 15:19:43 +0100 From: Sjur Nørstebø Moshagen <>

The link in Comment #9 isn't valid anymore, use this link instead:

http://www.divvun.no/doc/proof/hyph/testing/hy-regression-pl-forrest-smj-20080123.html

albbas commented 16 years ago

Comment 2512

Date: 2008-01-29 09:08:22 +0100 From: Sjur Nørstebø Moshagen <>

This bug has been fixed in the latest lexicon:

Davvisámi, version 1.0.1, 2008-01-28