nlbdev / pipeline

NLB branch of the super-project that aggregates all Pipeline related code. See https://github.com/daisy/pipeline for the main branch.
http://repo.nlb.no/pipeline
3 stars 1 forks source link

Improve hyphenation #104

Closed josteinaj closed 1 year ago

josteinaj commented 7 years ago

from Karis google sheet:

Hyphenation works, but the hyphenation is basic. Want an improvement here.

KariRudjord commented 6 years ago

About hyphenation also in #150, #10 and #3

KariRudjord commented 6 years ago

Example on wrong hyphenation from job 1879

Wrong hypen s.51 få-tt s. 61 Punjabs-lettene (Punjab-slettene) s. 75 tri-sthet (trist-het) s.81 K-hoda (Khoda) s. 108 s-kli (skli) s.113 h-vilke (hvilke) s. 155 c-harpai s.158 pl-utselig (plutselig) s. 163 komp-lott (kom-plott) s.254 s-enere (se-nere) s. 268 e-normt (enormt)

Words not divided but who should have been s. 56 kontroversielle (kon-tro-versielle, 3 possibilities) s. 112 jentene (jen-tene) s. 126 utfordringen (ut-for-drin-gen) s. 130 frokost (fro-kost) s.131 mursteinsgulvet (mur-steinsgulvet) s.137 grønnsaksforhandlere (grønn-saks-forhandlere) s. 130 kontrollen (kon-trollen) s. 141 bakgrunnen (bak-grunnen) s. 254 samfunnsbygging (sam-funnsbygging)

s. 280 kunde-r (kun-der)

matskober commented 6 years ago

One should never divide words with less than five characters!

Some errors I found: bor-salinohatten (borsalino-hatten) vin-glande (ving-lande) møtte-st (møt-test) kvis-trommet (kvist-rommet) k-vifor (kvi-for) no-ko (noko) T-Forden (T-Forden, but keep on same line!)

KariRudjord commented 6 years ago

Can you implement Mats' rule Never divide words with less than five characters? Then we avoid this: Much better now! But still some errors: få-tt (s 41 - job 1933) I also found, in same job, theese: s. 73 tri-sthet (tristhet or trist-het) s. 156 pl-utselig (plut-selig, plutse-lig or plutselig) s. 189 komp-lott (kom-plott)

I notice that Norbraille divides words more often and that the pages are more filled with text (improves readability). Is there a way to get words divided on different places? Maybe a new card and a low priority issue by now. I can make a new card for it in Josteins new sheet for next year. eg. image Possible dividing points: Jord-skjel-vet te-le-fon Uten-riks-de-par-te-men-tet

bertfrees commented 6 years ago

The examples you give are words that are not in the dictionary. So it seems the generated patterns don't extrapolate very well to unknown words, or the words that are in the dictionary are not representative enough.

Also the dictionary seems to be very conservative in making hyphenations. Now that I think about it, it doesn't look very much like a hyphenation dictionary. Every word has at most one break point. @josteinaj Are you sure it's not some kind of dictionary of compound words or something?

josteinaj commented 6 years ago

There's two parameters that I know how to adjust:

Since they're both currently 2, any words with less than 4 characters will not be hyphenated. If I change one of them to 3, any words with less than 5 characters will not be hyphenated.

In the list of standard hyphenation rules, there are 897 words ending in two characters after a hyphenation (administrasjons-by, alfabetiserings-år, armbands-ur, ...), and 9567 words starting with two characters before a hyphenation (an-fører, av-art, ma-ori, mo-dalen, ...). So the less invasive would be to increase RIGHT_HYPHEN_MIN to 3, meaning that those 897 words will not be hyphenated at the end of the word anymore. I'm going to try that...

I'll also update the words you mention.

josteinaj commented 6 years ago

@bertfrees I don't remember where we got it from, I think it is the source file used for compiling the libreoffice hyphenation table.

bertfrees commented 6 years ago

Yes, I think so too. You added it here: https://github.com/nlbdev/pipeline-mod-nlb/commit/ef9bda6. Can you check exactly how it was used originally?

josteinaj commented 6 years ago

I'll try googling around a bit.

KariRudjord commented 6 years ago

@josteinaj Din't we get a dictionary from Mari (from Språkbanken) which indicated all the places in words where it is possible to divide?

josteinaj commented 6 years ago

@KariRudjord yes, but I'm not sure if we actually used it. We should probably have a second look at it though if we really want to improve the dictionary.

I think the one we currently use are from http://no.speling.org/ - there's a norsk.words file in spell-norwegian-2.2.tar.gz there. But it seems we have more words than there are in that list.

josteinaj commented 6 years ago

@bertfrees I notice that when setting RIGHT_HYPHEN_MIN to 3, the missed hyphenations at the end of the word are not marked with a *. For instance:

On line 15221: missed hyphens: arbeids-ro

Probably/hopefully just something with how the test results are rendered though and not something with how the actual hyphenation works.

josteinaj commented 6 years ago

@KariRudjord Is this really right?

Jord-skjel-vet
Uten-riks-de-par-te-men-tet

I would think "-et" should rather stand by itself so the previous part of the word is not divided?

Jord-skjelv-et
Uten-riks-de-par-te-ment-et
bertfrees commented 6 years ago

@josteinaj Yes, that's what I meant when I said the test doesn’t take into account the LEFTHYPHENMIN and RIGHTHYPHENMIN parameters. But it's easy to fix.

KariRudjord commented 6 years ago

@josteinaj Actually it is optional if you want departemen-tet/skjel-vet or departement-et/skjelv-et as long as you are consistent :-)

josteinaj commented 6 years ago

@KariRudjord ok :).

KariRudjord commented 6 years ago

I checked hyphenation for a short text, and found 30 bad hyphens in 30 pages. E.g. the norwegian familiehjemmet (family home) should be divided familie-hjemmet, but the result is familieh-jemmet. September should be divided sep-tember, but the result is se-ptember.

Could you take it back to a earlier step in hyphenation implementation? And is that something you must do, @bertfrees or can @josteinaj do it? This should be prioritized.

bertfrees commented 6 years ago

I have to do it because I am using a modified version of patgen. I will commit the source code.

bertfrees commented 6 years ago

I have reverted to språkbanken-1.

In språkbanken-2, the example your give, "familie-hjemmet", became "fa-m-i-l-ie-hjem-met", and in språkbanken-3, some more hyphens were added: "fa-m-i-l-ie-hj-e-m-met". So while for some words the inferring of hyphens works well, this example shows that for other words it adds unwanted hyphens. So if you want to be sure you don't get unwanted hyphens, the inferring algorithm needs to be improved, or we can't use it.

Another thing you can try is to make a version of språkbanken-2 where the inferred hyphens are replaced with a + (e.g. "fa+m+i+l+ie-hjem+met"). We already tried this but only for the hyphens added in the second step ("fa-m-i-l-ie-h+j+e+m-met"). This didn't give satisfying results, but it might work better if you do it with the hyphens added in the first step. I'll leave that up to you @josteinaj and @usama49 if you want to try it. I have committed the source code of patgen so you should be able to do that yourself now.

KariRudjord commented 6 years ago

After going back to previous version of hyphenation list, the result is much better than it was.

There is still quite a lot of bad hyphens, though. (Hyphenation is very difficult, I don't think it is possible to get everything correct.) Is it a possibility that we can keep this list and I update it regularly when I find bad hyphenations? Does it work in a way where this is possible? It do contain words with dividing points where I can edit the dividing points? Or is it algorithms that is deciding where to divide?

E.g. de-tte should be dette Ri-vertonprisen should have more possibilities for hyphen Ri-ver-ton-prisen de-nne should be denne Lo-uis should be Louis sa-mmen should be sam-men tem-aer should be tema-er ta-nke should be tan-ke ov-erfor should be over-for

bertfrees commented 6 years ago

Hyphenation is done by an algorithm based on patterns. The list of patterns is generated from a list of hyphenated words. You can update the list of hyphenated words and regenerate the patterns.

The current patterns result in only 688 bad hyphens when applied to all the words in the språkbanken-1.txt file. This means almost none. So if you find bad hyphens it's probably because of mistakes in språkbanken-1.txt.

I found these words in språkbanken-1.txt:

de-t-te
de-n-ne
lo-uis
sa-m-m-en
te-m-a-er
ta-n-ke
ov-er-for
bertfrees commented 6 years ago

I wouldn't edit any of the språkbanken files though, because these were generated from other input data by Ammar. Like I suggested before, it would be good if Ammar could integrate his C# code with the rest of the code so that these conversions can be automated and you can edit the real input data.