latex3 / babel

The multilingual framework to localize LaTeX, LuaLaTeX and XeLaTeX
https://latex3.github.io/babel/
LaTeX Project Public License v1.3c
130 stars 35 forks source link

Unclear documenation for doubleletter.hyphenation transformation #128

Closed huftis closed 3 years ago

huftis commented 3 years ago

I have a question about the new doubleletter.hyphen transformation defined for the norsk language in the new 3.58 version of the babel package.

From the documenation, it’s not clear how it is supposed to be used, or why one would want to use it. It looks like one can write

\babelprovide[transforms = doubleletter.hyphen]{norsk}

But I wonder if this perhaps is based on some misunderstanding of the Norwegian hyphenation rules. Note that there is no general rule in Norwegian (Bokmål or Nynorsk) that says that double consonants should get an extra letter when the word is hyphenated. For example, the words ‘spinne’, ‘plassering’, and ‘trykkeri’ should be hyphenated like this (in a very narrow column):

spin-
ne

plas-
se-
ring

tryk-
ke-
ri

But with the doubleletter.hyphen transform enabled, the first two words are incorrectly hyphenated as

spinn-
ne

plass-
se-
ring

(The word trykkeri has the same hyphenation as before, since kk is not included as one of the consonant pairs defined for doubleletter.hyphen.)

It is true that we insert an extra letter in a few very special cases. But this only applies to compound words where there would in theory be a ‘triple consonant’ (which is forbidden, and simplified to a double consonant if the word is not hyphenated). For example, the word for ‘bus driver’ in Norwegian is a compound word generated from ‘buss’ + ‘sjåfør’, and is written ‘bussjåfør’ (instead of ‘busssjåfør’, if triple consonants were allowed). But when hyphenated, it turns into ‘buss-sjåfør’. The same is true for the Norwegian word for ‘eye contact’, which is the combination of ‘blikk’ and ‘kontakt’, and is written as either ‘blikkontakt’ or ‘blikk-kontakt’ (when hyphenated). (So note this rule also applies to ‘kk’ consonant pairs/triples.)

But this is a rule that only applies to a very small number of words, i.e., compound words where the first part is a word that would normally end with a double consonant and the second part is a word that begins with the same consonant. It’s not a general rule for words containing double consonants.

jbezos commented 3 years ago

@huftis I wanted to provide a number of transforms, so that the possible applications can be seen, but it's clear I went too far. I'll be more careful in the future. Is there any list with the most frequent cases?

huftis commented 3 years ago

@huftis I wanted to provide a number of transforms, so that the possible applications can be seen, but it's clear I went too far. I'll be more careful in the future. Is there any list with the most frequent cases?

In Norwegian, arbitrary words (mostly nouns) can be combined into a compound word. But I can take a look at the dictionary and find the compounds words with this ‘triple consonant → double consonant’ property that are defined there. They are probably the most common ones. I’ll get back to you.

huftis commented 3 years ago

OK, I had a look at the official dictionary files for Norwegian Bokmål and Norwegian Nynorsk, which are available under the CC-BY 4.0 license. I found 380 Norwegian Bokmål words and 200 Norwegian Nynorsk compound words with this ‘double letter’ property.

These are the lemmas (base words). Most words also have inflections (e.g. singular and plural form). For example, the word ‘villaks’ (wild salmon) has the following conjugation: villaks, villaksen, villakser, villaksene.

To support this, I guess we have to write a list of all (conjugated) words with and mark the ‘double letter’ hyphenation point.

Note that some of the words are compound words made by joining more than two words. Example: volleyballandslag. This means ‘national volleyball team’, and is a compound word made from ‘volleyball’ and ‘landslag’ (‘national team’). And ‘volleyball’ is a compound word made from ‘volley’ and ‘ball’, and ‘landslag’ is a compound word made from ‘land’ (‘country’) and ‘lag’ (‘team’). So the best way of hyphenating the word would actually be some sort of multilevel hyphenation, with the preferred hyphenation being:

volleyball-landslag

the secondary hyphenation being

volleyballands-lag or (the somewhat misleading, in that it actually means ‘volley’ + ‘national ball team’) volley-ballandslag

and the third-level hyphenation would be:

vol-leyballandslag

jbezos commented 3 years ago

380 seems a bit too much 🙂, but a more restricted list with the most frequent cases would be enough. Anyway, having a list would allow me to experiment and see the limits of the mechanism, and how to improve it.

The same applies to the ‘ranked’ rules. Depending on the context babel is able to assign different penalties to selected hyphenation points (discretionaries). In Spanish, for example, most of cases (but not all) are covered by a single rule, as explained here.

jbezos commented 3 years ago

I'm closing this issue because I haven’t found a list of words, and therefore I can’t continue, Feel free to re-open it with a pointer or a list. Also, I've created a page in the babel site with an explanation about how to deal with them in luatex: https://latex3.github.io/babel/guides/locale-norwegian.html .