gusbrs / zref-clever

Clever LaTeX cross-references based on zref
LaTeX Project Public License v1.3c
11 stars 4 forks source link

Add Dutch dictionary #5

Closed niluxv closed 2 years ago

niluxv commented 2 years ago

Adds Dutch dictionary to zref-clever.

gusbrs commented 2 years ago

Hi @niluxv , this is great, thank you very much!

It will definitely be merged, but I'd like to understand one thing better first: the multiple articles. You mentioned that several words have multiple articles, how does this work for Dutch? What does the "correct" article depend on? Preference or style? Context?

My native language has a few multiple gendered words (very few), but this only happens for different meanings of the word, so that there's only one correct gender for a given meaning. Well, I just did not anticipate this need / possibility.

Another question regarding how Dutch works. The genders stored in the dictionaries have a single purpose, which is to support the nudge feature and the g option, but does Dutch actually inflects the surrounding articles according to gender?

On practical matters, the regression test which fails (zc-dictionaries01) is expected to fail for any change in the dictionaries, so as long as this is the only one, there's nothing to worry about, and I'll update the test files after merging.

niluxv commented 2 years ago

Femine and masculine use the same article "de", while neutral uses the article "het". I know whether a word uses "de" or "het", but when I want to know the gender of a "de" word I have to look it up in a dictionary (even though I am a native Dutch speaker) (there are of course clear exceptions, everything ending on -ing seems to be femine, etc.).

Only a few words are both neutral and femine and/or masculine. In such cases both "de" and "het" are correct (probably decided after long debate of which form is correct :smile:).

I have no idea how to test locally. I have never used .dtx/docstrip before.

gusbrs commented 2 years ago

Femine and masculine use the same article "de", while neutral uses the article "het". I know whether a word uses "de" or "het", but when I want to know the gender of a "de" word I have to look it up in a dictionary (even though I am a native Dutch speaker) (there are of course clear exceptions, everything ending on -ing seems to be femine, etc.).

Only a few words are both neutral and femine and/or masculine. In such cases both "de" and "het" are correct (probably decided after long debate of which form is correct :smile:).

I see. :smile:

Well, being practical, in cases where both "de" and "het" are correct, is this just a matter of personal preference for one to use one or another? The same question actually applies, as things are currently set, to masculine and feminine, but considering what you told me so far, perhaps we could think of a fm gender (meaning masculine or feminine) since they refer to the same article form.

Actually, what I'm thinking here is that it might be meaningful to just accept multiple genders for a type. But that would only work if the article is a matter of personal preference, which I don't know if is the case.

WDYT?

I have no idea how to test locally. I have never used .dtx/docstrip before.

No need to worry, I have tested locally, and only a couple of expected tests failed. I have deliberately made the test suite to be very sensitive, since I prefer the false positive to the false negative. But since the ones related to dictionaries involve contributors also being surprised by some tests without knowing of this, I may eventually refrain from those if they prove to be too noisy.

Well, in case you are curious, just run l3build check on the root of your local fork. But, as I said, don't worry, the couple of tests which failed were expected to do so, I'll adjust them as needed after merging.

niluxv commented 2 years ago

Well, being practical, in cases where both "de" and "het" are correct, is this just a matter of personal preference for one to use one or another?

Yes, if a word is both f/m (de) and neutral (het), then the article choice is a personal preference.

The same question actually applies, as things are currently set, to masculine and feminine, but considering what you told me so far, perhaps we could think of a fm gender (meaning masculine or feminine) since they refer to the same article form.

For this package the difference between femine and masculine shouldn't matter, but it also felt a little strange to just merge them. Of course the gender can matter: "the woman her possessions" (nl:"de vrouw haar eigendommen"), "the man his possessions" (nl:"de man zijn eigendommen"). (But then, how often do you say "the equation her left side"? You would just say "the left side of the equation"; same in Dutch.) Many words that are both femine and masculine were originally only femine, which is why I chose femine for all f+m words. Like in any language, there are of course also words for which the gender depends on the context, like "person" (nl:"persoon") copying the gender of the person the word refers to.

I have deliberately made the test suite to be very sensitive, since I prefer the false positive to the false negative.

:+1:

Well, in case you are curious, just run l3build check on the root of your local fork.

Thank you, good to know!

gusbrs commented 2 years ago

Give some time, let me see if I can do anything to better support the case multiple genders for a given type.

gusbrs commented 2 years ago

Hi @niluxv , I've just added support for types with multiple genders. The gender key can now also receive a comma separated list as value, so that we can have something like:

type = table ,
  gender = {f,m} ,
  Name-sg = Tabel ,
  name-sg = tabel ,
  Name-pl = Tabellen ,
  name-pl = tabellen ,

This is now valid syntax in both \zcLanguageSetup and in the built-in dictionaries. And the g nudge now checks if the value it receives is contained in that list (and warns if it is not).

Would you like to review the dictionary in light of this new possibility?

niluxv commented 2 years ago

Done. Also removed the fig. abbreviation of figuur (figure) since I think it's use is not common enough for inclusion. (I initially added it as the abbreviation was also present in the English dictionary and on the Wikipedia list of Dutch abbreviations).

gusbrs commented 2 years ago

Great! About the figure abbreviation I was meaning to suggest that too. There is no reason to include it for Dutch just because it is there for English. You should just include abbreviations for Dutch if they are "traditional"/"well established" for the language, for the reasons I mentioned previously.

I'll look more carefully soon, but so far I spotted what seems a typo for the gender of figure: gender = { n , v , m } ,. I suppose you meant gender = { n , f , m } ,.

gusbrs commented 2 years ago

Another question here. I'm taking a look at babel language files for Dutch, and I see both dutch.ldf and afrikaans.ldf inside /usr/local/texlive/2021/texmf-dist/tex/generic/babel-dutch. Do you happen to know if afrikaans could be aliased to dutch for the purposes of this dictionary?

niluxv commented 2 years ago

I only have a single book in Dutch with numbered figures (most books are in English), so it is really hard for me to guess how common the abbreviation fig. is... (but at least it is an official abbreviation)

That v was a mistake indeed. The v stands for vrouwelijk (femine), the Dutch adjective for femininum (latin).

Probably Afrikaans can use almost the same dictionary, but I would be surprised if exactly the same would do. Anyway, Afrikaans is a daughter language of Dutch but it is not the same language, so I wouldn't alias it to Dutch.

gusbrs commented 2 years ago

@niluxv I've just merged it. Thank you very much! I intend to prepare a release soonish, but not immediately, since there are a couple of things I'd like to include before doing so. But it should not be too long until you can enjoy it from your regular installation.

I only have a single book in Dutch with numbered figures (most books are in English), so it is really hard for me to guess how common the abbreviation fig. is... (but at least it is an official abbreviation)

I know how it is ;-). But I think it is a good idea to include abbreviations sparingly, so I concur with the move. And, if new information arises, we can always review this later on.

Probably Afrikaans can use almost the same dictionary, but I would be surprised if exactly the same would do. Anyway, Afrikaans is a daughter language of Dutch but it is not the same language, so I wouldn't alias it to Dutch.

OK, no alias. I just had to ask.

gusbrs commented 1 year ago

Hi @niluxv , may I ask you something related to the dutch language file?

I've just prepared some localization guidelines for contributors and, among other things, I included a recommendation of being consistent with babel captions, in these lines:

https://github.com/gusbrs/zref-clever/blob/5585b57b6939c6c57645610f122c840326a39b33/zref-clever.dtx#L8547-L8557

I think this is reasonable. And I was checking things in this light, and noticed that you've chosen for the appendix type the term "Appendix", while babel's dutch.ldf uses the term "Bijlage". What do you think of following babel in this regard?

niluxv commented 1 year ago

Hi @gusbrs, of course.

Both are correct, but it depends a bit on the context which one feels more natural. "bijlage" is the translation of attachment, but can also be used for appendix. I consulted some other people, and it appears "appendix" is considered more formal than "bijlage". In say a thesis I think "appendix" would be the canonical choice. (By the way, "appendix" is not an Anglicism; it comes (directly) from latin IIUC.)

But I guess consistency with babel might be more important than using the most natural translation, especially when the most natural translation depends on context like in this case...

gusbrs commented 1 year ago

Both are correct, but it depends a bit on the context which one feels more natural. "bijlage" is the translation of attachment, but can also be used for appendix. I consulted some other people, and it appears "appendix" is considered more formal than "bijlage". In say a thesis I think "appendix" would be the canonical choice. (By the way, "appendix" is not an Anglicism; it comes (directly) from latin IIUC.)

But I guess consistency with babel might be more important than using the most natural translation, especially when the most natural translation depends on context like in this case...

I assumed that both would be good choices in principle. But I think the requirement makes sense. If babel prints "Bijlabe A" by default, wouldn't it be somewhat jarring that zref-clever prints "appendix A" for the same thing? Of course, both are configurable. My thought is that these defaults should match (unless there's a strong reason not to).

Would you be so kind as to prepare a new snippet for the appendix type (with gender and plurals)?

Btw, in dutch.ldf \appendixname is actually defined as B\ij lage. I've investigated, and \ij is defined by the kernel, and puts a kern between i and j, but it says dutch has an "ij" letter in a comment. Since zref-clever presumes UTF-8 anyway, perhaps it could be used. Do you happen to know what would be wiser?

niluxv commented 1 year ago

If babel prints "Bijlabe A" by default, wouldn't it be somewhat jarring that zref-clever prints "appendix A" for the same thing?

Yes, that would be horrifying! Consistency within the document is much more important than using the best/most natural word available.

I'm pretty sure both "bijlages" and "bijlagen" are correct plurals for "bijlage"; does babel include a plural here? Will have to check the dictionary for the gender.

Btw, in dutch.ldf \appendixname is actually defined as B\ij lage.

Hm, there is U+0133. Unicode describes it as a ligature of i and j (but it does have a separate code-point), so if the font includes it, shouldn't unicode aware LaTeX automatically use it? I'm not sure here. Btw, the English Wikipedia includes quite some information about the Dutch usage of "ij".

niluxv commented 1 year ago

Babel has a line

encl = B\ij lage(n)

in babel-nl.ini, so Bijlagen should be the plural for consistency with babel.

Regarding the ij issue, polyglossia has

    \def\appendixname{Bijlage}%

(just an i and a j; it is not a special unicode symbol), so I think we can just use separate i and j, and probably LuaLaTeX will do the right thing.

gusbrs commented 1 year ago

Yes, that would be horrifying! Consistency within the document is much more important than using the best/most natural word available.

Good we agree. :smile:

I'm pretty sure both "bijlages" and "bijlagen" are correct plurals for "bijlage"; does babel include a plural here? Will have to check the dictionary for the gender.

Babel has a line

encl = B\ij lage(n)

in babel-nl.ini, so Bijlagen should be the plural for consistency with babel.

Do you think both plurals sound equally good? If that's the case, I'd say "bijlagen" too.

Hm, there is U+0133. Unicode describes it as a ligature of i and j (but it does have a separate code-point), so if the font includes it, shouldn't unicode aware LaTeX automatically use it? I'm not sure here. Btw, the English Wikipedia includes quite some information about the Dutch usage of "ij".

Regarding the ij issue, polyglossia has

    \def\appendixname{Bijlage}%

(just an i and a j; it is not a special unicode symbol), so I think we can just use separate i and j, and probably LuaLaTeX will do the right thing.

Mhm, consider the following document:

\documentclass{book}

\usepackage[dutch]{babel}

\begin{document}

\showoutput

\appendixname{}

B\ij lage

Bijlage % U+0133

Bijlage % "i and j"

\end{document}

Compiling it with pdflatex we get:

\vbox(627.36243+0.0)x380.0
.\glue 22.0
.\vbox(605.36243+0.0)x345.0, shifted 35.0
..\vbox(12.0+0.0)x345.0, glue set 5.55556fil
...\glue 0.0 plus 1.0fil
...\hbox(6.44444+0.0)x345.0
....\hbox(6.44444+0.0)x345.0, glue set 339.99998fil
.....\glue 0.0 plus 1.0fil
.....\OT1/cmr/m/n/10 1
..\glue 18.06749
..\glue(\lineskip) 0.0
..\vbox(550.0+0.0)x345.0, glue set 504.0fil
...\write-{}
...\write1{\babel@aux{dutch}{}}
...\glue(\topskip) 3.05556
...\hbox(6.94444+1.94444)x345.0, glue set 300.06107fil
....\hbox(0.0+0.0)x15.0
....\OT1/cmr/m/n/10 B
....\OT1/cmr/m/n/10 i
....\kern -0.20004
....\penalty 10000
....\glue 0.0
....\OT1/cmr/m/n/10 j
....\OT1/cmr/m/n/10 l
....\OT1/cmr/m/n/10 a
....\OT1/cmr/m/n/10 g
....\OT1/cmr/m/n/10 e
....\penalty 10000
....\glue(\parfillskip) 0.0 plus 1.0fil
....\glue(\rightskip) 0.0
...\glue(\parskip) 0.0 plus 1.0
...\glue(\parskip) 0.0
...\glue(\baselineskip) 3.11111
...\hbox(6.94444+1.94444)x345.0, glue set 300.06107fil
....\hbox(0.0+0.0)x15.0
....\OT1/cmr/m/n/10 B
....\OT1/cmr/m/n/10 i
....\kern -0.20004
....\penalty 10000
....\glue 0.0
....\OT1/cmr/m/n/10 j
....\OT1/cmr/m/n/10 l
....\OT1/cmr/m/n/10 a
....\OT1/cmr/m/n/10 g
....\OT1/cmr/m/n/10 e
....\penalty 10000
....\glue(\parfillskip) 0.0 plus 1.0fil
....\glue(\rightskip) 0.0
...\glue(\parskip) 0.0 plus 1.0
...\glue(\parskip) 0.0
...\glue(\baselineskip) 3.11111
...\hbox(6.94444+1.94444)x345.0, glue set 300.06107fil
....\hbox(0.0+0.0)x15.0
....\OT1/cmr/m/n/10 B
....\OT1/cmr/m/n/10 i
....\kern -0.20004
....\penalty 10000
....\glue 0.0
....\OT1/cmr/m/n/10 j
....\OT1/cmr/m/n/10 l
....\OT1/cmr/m/n/10 a
....\OT1/cmr/m/n/10 g
....\OT1/cmr/m/n/10 e
....\penalty 10000
....\glue(\parfillskip) 0.0 plus 1.0fil
....\glue(\rightskip) 0.0
...\glue(\parskip) 0.0 plus 1.0
...\glue(\parskip) 0.0
...\glue(\baselineskip) 3.11111
...\hbox(6.94444+1.94444)x345.0, glue set 299.86102fil
....\hbox(0.0+0.0)x15.0
....\OT1/cmr/m/n/10 B
....\OT1/cmr/m/n/10 i
....\OT1/cmr/m/n/10 j
....\OT1/cmr/m/n/10 l
....\OT1/cmr/m/n/10 a
....\OT1/cmr/m/n/10 g
....\OT1/cmr/m/n/10 e
....\penalty 10000
....\glue(\parfillskip) 0.0 plus 1.0fil
....\glue(\rightskip) 0.0
...\glue -1.94444
...\glue 0.0 plus 1.0fil
...\glue 0.0
..\glue(\baselineskip) 25.29494
..\hbox(0.0+0.0)x345.0
...\hbox(0.0+0.0)x345.0

The first three are identical, and the last, using "i and j" is different, and indeed looks different. And, though polyglossia is supported, babel is definitely our reference.

In the kernel, \ij is defined as one of:

\DeclareTextCommand{\ij}{OT1}{%
    \nobreak\hskip\z@skip i\kern-0.02em\nobreak\hskip\z@skip j}
\DeclareTextSymbol{\ij}{T1}{188}
\DeclareUnicodeSymbol{\ij}{"0133}

So, at the font encoding level, for unicode engines \ij is literally U+0133. But our issue here is at the input encoding side. Since, when zref-clever was released the default input encoding was already UTF-8, even for pdflatex, I've just been assuming it for all the language files. So I think "ij" (U+0133) is the most natural choice, even if \ij would be equivalent and safe.

All in all, I think we are just missing the gender(s).

gusbrs commented 1 year ago

Just complementing the previous comment: definitely UTF-8 (see https://chat.stackexchange.com/transcript/message/62644791#62644791). Indeed, I just went ahead and formalized it as a requirement.

niluxv commented 1 year ago

I created a PR for the appendix -> bijlage change. I think the ij stuf is orthogonal to this change since many words in the translation would need to be changed.

Compiling it with pdflatex we get:

Ah, sorry, I thought you were talking about unicode aware TeX engines. I'll do some tests on LuaTeX to see whether ij (2 letters) and ij (1 letter) makes a difference there. Would be unfortunate if we regressed LuaTeX support in favour of pdfTeX support...

gusbrs commented 1 year ago

Ah, sorry, I thought you were talking about unicode aware TeX engines. I'll do some tests on LuaTeX to see whether ij (2 letters) and ij (1 letter) makes a difference there. Would be unfortunate if we regressed LuaTeX support in favour of pdfTeX support...

Well, unicode aware engines too. I've just tested with pdflatex because that was the case where using the unicode symbol might be troublesome. zref-clever test suite runs all tests for pdlatex, lualatex, xelatex, and its respective dev formats.

Indeed, I just tried that same example with lualatex and xelatex, and the result is equivalent to that of pdflatex. Namely, the first three forms are identical, and what we want. And the last one, "i and j" is not, it looks different.

So, given zref-clever uses UTF-8 language files, we should use U+0133 instead of \ij (or ij).

I think the ij stuf is orthogonal to this change since many words in the translation would need to be changed.

What "many words" do you mean here? Why do you say that?

gusbrs commented 1 year ago

What "many words" do you mean here? Why do you say that?

Answering my own question: Oh, I see, there are other "ij"s too around, presumably the same thing. Perhaps even "oe", I don't know.

So, understood, orthogonal, and I'll merge the PR.

Regarding whether to use U+0133 or not, I don't think the "consistency with babel" needs to be pushed to that degree. This is not to say "don't do it", just really that it is your call. What you think works best, I'm with it. I really have no idea of typographical traditions in Dutch.

niluxv commented 1 year ago

Indeed, I just tried that same example with lualatex and xelatex, and the result is equivalent to that of pdflatex. Namely, the first three forms are identical, and what we want. And the last one, "i and j" is not, it looks different.

On my LuaLaTeX setup, all of ij (two letters), ij (one letter) and \ij give the same result (slight kerning between the letters; I haven't found a font yet which includes a special ligature). Only i\/j disables the kern and prints differently.

So I guess we can go for the unicode character. On unicode aware TeX it doesn't seem to matter and on pdfTeX it applies the kern (which I think is a good thing).

niluxv commented 1 year ago

Made a PR. Okay, "many" was exaggerated, it was only three.

Perhaps even "oe", I don't know.

No, it's not french :)

gusbrs commented 1 year ago

On my LuaLaTeX setup, all of ij (two letters), ij (one letter) and \ij give the same result (slight kerning between the letters; I haven't found a font yet which includes a special ligature). Only i\/j disables the kern and prints differently.

Just testing with lualatex here the following:

\documentclass{book}

\usepackage[dutch]{babel}

\begin{document}

\showoutput

Bijlage % U+0133

Bijlage % "i and j"

\end{document}

I get:

...\hbox(6.94+2.06)x345.0, glue set 300.14001fil, direction TLT
....\localpar
.....\localinterlinepenalty=0
.....\localbrokenpenalty=0
.....\localleftbox=null
.....\localrightbox=null
....\hbox(0.0+0.0)x15.0, direction TLT
....\TU/lmr/m/n/10 B
....\TU/lmr/m/n/10 ij
....\TU/lmr/m/n/10 l
....\TU/lmr/m/n/10 a
....\TU/lmr/m/n/10 g
....\TU/lmr/m/n/10 e
....\penalty 10000
....\glue(\parfillskip) 0.0 plus 1.0fil
....\glue(\rightskip) 0.0
...\glue(\parskip) 0.0 plus 1.0
...\glue(\parskip) 0.0
...\glue(\baselineskip) 3.0
...\hbox(6.94+2.06)x345.0, glue set 299.85999fil, direction TLT
....\localpar
.....\localinterlinepenalty=0
.....\localbrokenpenalty=0
.....\localleftbox=null
.....\localrightbox=null
....\hbox(0.0+0.0)x15.0, direction TLT
....\TU/lmr/m/n/10 B
....\TU/lmr/m/n/10 i
....\TU/lmr/m/n/10 j
....\discretionary (penalty 50)
.....< \TU/lmr/m/n/10 -
....\TU/lmr/m/n/10 l
....\TU/lmr/m/n/10 a
....\TU/lmr/m/n/10 g
....\TU/lmr/m/n/10 e
....\penalty 10000
....\glue(\parfillskip) 0.0 plus 1.0fil
....\glue(\rightskip) 0.0

Screenshot from 2022-12-27 08-38-25

Isn't that what you get? Indeed there's a slight difference in kerning. (And, if I understand the output correctly, there's a difference in hyphenation, but whatever the right one is, it's not really our "jurisdiction").

So I guess we can go for the unicode character. On unicode aware TeX it doesn't seem to matter and on pdfTeX it applies the kern (which I think is a good thing).

I guess results will depend on the font after all, but I also think the unicode character is the safest. When you have time for this, it will be much appreciated. Thank you.

gusbrs commented 1 year ago

Made a PR. Okay, "many" was exaggerated, it was only three.

Oh, already done! Thank you once again. I have merged it, and this will be in the next release, which hopefully won't take long (I just want to finish the Italian localization, which is in process).

Perhaps even "oe", I don't know.

No, it's not french :)

One never knows. ;-)

niluxv commented 1 year ago

Isn't that what you get?

Well, I'm always using fontspec + polyglossia in my LuaLaTeX setup; then a separate i + j gives proper kerning (identical to U+0133) automatically.

And, if I understand the output correctly, there's a difference in hyphenation

Huh, really? Good to use U+0133 then. Hyphenating between an i and a j is a terrible idea in Dutch.

gusbrs commented 1 year ago

Well, I'm always using fontspec + polyglossia in my LuaLaTeX setup; then a separate i + j gives proper kerning (identical to U+0133) automatically.

I see, but my reference here, as I mentioned, must be babel. Curiosity, is there a reason you prefer polyglossia?

Huh, really? Good to use U+0133 then. Hyphenating between an i and a j is a terrible idea in Dutch.

No, the \discretionary is between the J and the L. But goes missing when the unicode symbol is used (same with babel's \ij). Anyway, as I said, this is out of our scope here. If it's wrong, it must be reported to whoever takes care of hyphenation for Dutch...

gusbrs commented 1 year ago

Well, I'm always using fontspec + polyglossia in my LuaLaTeX setup; then a separate i + j gives proper kerning (identical to U+0133) automatically.

You actually got me curious with this. :wink:

So I compiled the following document with lualatex:

\documentclass{book}

\usepackage{fontspec}

\usepackage{polyglossia}
\setdefaultlanguage{dutch}

\linespread{0.5}

\newcommand\foo{\rule[-.4ex]{0.01cm}{2ex}}

\begin{document}

\Huge

\showoutput

B\ij lage\foo

Bijlage\foo % U+0133

Bijlage\foo % "i and j"

\end{document}

Screenshot from 2022-12-27 12-55-08

The difference is subtle, but it's there. And not much different from what pdflatex + babel deliver, I dare say. Perhaps it's your font which has better kerning for the pair?

niluxv commented 1 year ago

The difference is subtle, but it's there. And not much different from what pdflatex + babel deliver, I dare say. Perhaps it's your font which has better kerning for the pair?

I'm so sorry. Apparently I was still using a custom font (New Computer Modern); I thought I already disabled that. Sorry for the confusion!

No, the \discretionary is between the J and the L. But goes missing when the unicode symbol is used (same with babel's \ij). Anyway, as I said, this is out of our scope here. If it's wrong, it must be reported to whoever takes care of hyphenation for Dutch...

I'll file an issue with ~babel~ the dutch hyphenation file then; bij-la-ge is correct, and the unicode ij shouldn't affect it.

Curiosity, is there a reason you prefer polyglossia?

Historically grown I guess. I guess today the more natural choice for LuaTeX would be babel; I just haven't switched back yet.

gusbrs commented 1 year ago

I'm so sorry. Apparently I was still using a custom font (New Computer Modern); I thought I already disabled that. Sorry for the confusion!

No problem at all. I was really just curious. And learning about it.

I'll file an issue with babel then; bij-la-ge is correct, and the unicode ij shouldnt affect it.

I'm not sure there's actually any problem. I don't know well how \discretionary are supposed or not to end up in the logs generated by \showoutput, which is somewhat mysterious to someone who was not there when the scriptures were written. Perhaps some further testing to see if the word really doesn't hyphenate before going through the trouble would be advisable.

Historically grown I guess. I guess today the more natural choice for LuaTeX would be babel; I just haven't switched back yet.

"Old habits", I see. Indeed that's the advice I see. But if it's working for you, then it's all good. ;-)

gusbrs commented 1 year ago

I'm not sure there's actually any problem. I don't know well how \discretionary are supposed or not to end up in the logs generated by \showoutput, which is somewhat mysterious to someone who was not there when the scriptures were written. Perhaps some further testing to see if the word really doesn't hyphenate before going through the trouble would be advisable.

It appears there's indeed something to it.

\documentclass{book}

\usepackage[dutch]{babel}

\begin{document}

\showhyphens{B\ij lage}

\showhyphens{Bijlage} % U+0133

\showhyphens{Bijlage} % "i and j"

\end{document}

Produces:

Underfull \hbox (badness 10000) in paragraph at lines 7--7
[] \OT1/cmr/m/n/10 Bijlage

Underfull \hbox (badness 10000) in paragraph at lines 9--9
[] \OT1/cmr/m/n/10 Bijlage

Underfull \hbox (badness 10000) in paragraph at lines 11--11
[] \OT1/cmr/m/n/10 Bij-lage
niluxv commented 1 year ago

Yeah, I had already checked the \showhyphens output.

I just don't know where to file an issue. I guess the hyphenation file is NL hyphenation, but it doesn't seem to have an issue tracker...

Edit: Huh, the dutch babel hyphenation differs from the polyglossia one. I thought both used the same hyphenation file.

gusbrs commented 1 year ago

I just don't know where to file an issue. I guess the hyphenation file is NL hyphenation, but it doesn't seem to have an issue tracker...

I've never needed to do this, so I don't know much either. But digging through the TL tree, I found texmf-dist/tex/generic/hyph-utf8/loadhyph/loadhyph-nl.tex and texmf-dist/tex/generic/hyph-utf8/patterns/tex/hyph-nl.tex. On the first, there's a link to https://tug.org/tex-hyphen/, which has a "tex-hyphen" mailing list. The second contains the email of Piet Tutelaers, the maintainer of hyphen-dutch. Not sure if up to date, given the last change seems to be from 2000.

Edit: Huh, the dutch babel hyphenation differs from the polyglossia one. I thought both used the same hyphenation file.

That's a surprise for me too, I also presumed they used the same structure there. And, is it better? If so, you just found a rationale for your "old habit". ;-)