sile-typesetter / sile

The SILE Typesetter — Simon’s Improved Layout Engine
https://sile-typesetter.org
MIT License
1.65k stars 98 forks source link

Double hyphens in compounds words in Czech, Portuguese, etc. #1963

Closed Omikhleia closed 8 months ago

Omikhleia commented 8 months ago

As noted in #1960, it seems Czech also repeats hyphens when breaking a compounds word. Some other languages might do the same, see below.

The same solution would likely apply. But I only found single references to this feature in TeX StackExchange discussions -- We may need some more normative documents and references before generalizing such a feature...

Omikhleia commented 8 months ago

According to your profile, @jodros you are from Brasil, perhaps you can comment on these rules for Portuguese?

If this is not widespread, we may need settings... If it is locale/country/idiom dependent, we may want to proceed with #1641 so as to be able to use BCP47-qualified languages to select the proper rules by default...

Typography is hard ;)

jodros commented 8 months ago

perhaps you can comment on these rules for Portuguese?

Right, but I still don't know how the hyphenation algorithm works, where could I start?

Omikhleia commented 8 months ago

@jodros

Right, but I still don't know how the hyphenation algorithm works, where could I start?

As far as you know, if line is broken at the dash in Portuguese "anti-inflamatório", should it yield:

case 1 (nothing fancy)

...... anti-
inflamatório

Or case 2 (repeated hyphens):

....... anti-
-inflamatório

In the second case, we'd need to generalize the solution adopted for Polish. It seems we have to do it for other languages too, but trying to understand which languages are concerned and whether it's a widespread rule in these languages --- so as to propose the correct generalization.

alerque commented 8 months ago

perhaps you can comment on these rules for Portuguese?

Right, but I still don't know how the hyphenation algorithm works, where could I start?

It's not black magic, but it does have deep grey tones. :wink: It actually isn't that different from the TeX algorithm which is extensively documented in various forms.

But honestly I don't think it is that relevant. What we're looking for here (I think) is commentary on Portuguese orthography, not the algorithm that implements it.

Omikhleia commented 8 months ago

As for the general logic (simplified, but it took me a while to get a grasp of it -- a bit off-topic here, but worth trying to explain anyway):

  1. Input text is set in "unshaped" text nodes initially
  2. Unshaped nodes are later shaped into "nnodes" (with dimensions)
  3. As part of the process, the nodes are also segmented into elementary "words"
    • In most cases via SILE.nodeMakers.unicode which uses ICU to identify word boundaries
    • Sometimes via a language-specific subclass of the latter, to implement some additional fancy rules (French does it for its special interpretation of punctuation spaces; Polish now does it too for repeated hyphens).
  4. Each word is hyphenated (so a parent word node has children segments, for each potential hyphenation point)
    • (Most of the time), using the Liang algorithm with language-specific patterns (introducing "-" discretionary nodes where hyphenation may occur, between segments) = that's the most TeX-like part here....
    • Some languages (Turkish, and now Catalan too) require a specific post-hyphenation logic (for context-dependent discretionary nodes)
      1. Later at paragraphing time, this will be fed to the line-breaking engine, again a TeX-like part, but that's another story

To recap, then:

Omikhleia commented 8 months ago

And @alerque commented while I was typing the above details....

It actually isn't that different from the TeX algorithm

I beg to disagree (though in the good sense), SILE might be "better" here:

alerque commented 8 months ago

I beg to disagree (though in the good sense), SILE might be "better" here:

I was also talking about completely the wrong topic. I was thinking about processing the actual hyphenation patterns and applying them to already segmented chunks. Yes the bigger picture of where and how shaping happening is quite different and not only more adaptable but also more robust.

alerque commented 8 months ago

Any thoughts on whether we'll know enough to add fixes for other languages soon or should I go ahead with v0.14.15 with the Catalan/Polish/Turkish/French features we have queued up already?

Omikhleia commented 8 months ago

Any thoughts on whether we'll know enough to add fixes for other languages soon or should I go ahead with v0.14.15 with the Catalan/Polish/Turkish/French features we have queued up already?

It depends when you want to ship 0.14.15 -- I'm willing to work on the topic, but I don't think it needs to be rushed -- after all, SILE's presence on GitHub just passed 10+ years, and no one came asking for these... So I bet we can take some time to think on how to do it properly. In the same vein, #1242 (deriving from a fix where I needed to deactivate the French unicode segmenter) could possibly be addressed too in a nicer way.

In the meantime, we have a "quick workaround" if anyone urgently needs the repeating dashes in any language. Just insert the ugly hack after your first target language change, and voilà!

\begin[papersize=a6]{document}
\set[parameter=document.parindent, value=1.25em]
\nofolios
\language[main=pt]
% BEGIN DOUBLE HYPHEN WORKAROUND
\lua{
-- Switch to Polish temporarily
-- and steal its node maker to current main language
local current = SILE.settings:get("document.language")
SILE.settings:temporarily(function ()
  SILE.call("language", { main = "pl" })
  SILE.nodeMakers[current] = SILE.nodeMakers.pl
end)
}% END DOUBLE HYPHEN WORKAROUND

\font[size=16]
\kern[width=210pt] anti-inflamatório

\end{document}

(Checked with Portuguese, Czech and Spanish)

alerque commented 8 months ago

It depends when you want to ship 0.14.15 -- I'm willing to work on the topic, but I don't think it needs to be rushed -- after all, SILE's presence on GitHub just passed 10+ years, and no one came asking for these.

I have an upcoming publication project that wants to use the alternate Turkish apostrophe handling and it is always much nicer to do production work in a shipped stable version of SILE. At this point the release machinery is working pretty well and it isn't too much hassle to make small patch releases with incremental improvements.

Omikhleia commented 8 months ago

Interesting feedback here https://github.com/typst/typst/issues/3235#issuecomment-1924720389 adding (lower) Sorbian and Croatian to the list, and confirming Czech and Slovak.

Sorbian is a minority language (< 50000 people), it doesn't have a 2-letter language codes. Unless mistaken the 3-letter codes are hsb (Upper Sorbian), dsb (Lower Sorbian) and wen (Sorbian or "Wendish" collectively)

Omikhleia commented 8 months ago

Food for thought: I am not sure we should use settings to enable/disable such features, at the cost of checking them many times, when they wouldn't change much normally.

A possible alternative would be encode this in the BCP47 language name, as an extension.

So far, unless mistaken, BCP47 only has two official extensions, -t- (RFC6497) and -u- (RFC6067), but "private" -x- extension could be used here for our own purpose.

For instance:

DavidLRowe commented 8 months ago

A minor point: I believe that string of characters after the -x- of a private tag is limited to 8 characters. So, for example, -x-nohyphens would be too long.

alerque commented 8 months ago

This is not a comment on using BCP-47 private extensions because I haven't looked into that...

But yes @Omikhleia sometimes where we want to use a setting is too hot a loop to actually be checking it given that they can change almost any time. But something we haven't really utilized yet but could if we need to is callbacks: there is no reason we can't rig up settings:set() with a callback function that invalidates a cache or private variable used somewhere for efficiency purposes. The end user would not need to be any the wiser. All we would need is a registry to store the callbacks and they could be registered from almost anywhere. Since they would be Lua functions that act as closures they would serve to reach into whatever private implementation was used to speed up the hot looks with a cached value while still allowing it to be changed with a setting.

Omikhleia commented 8 months ago

@alerque Yes, active hooks on settings is also a possibility I had in mind too. I'm always reluctant on such hooks / callbacks (because ordering is unclear and side-effects are not always intended), but it may have to be considered.

As an additional food for though: I suspect those language would not repeat the hyphens when breaking URLs (and thus would have to bypass it, as does the current _fr_noSpacingRules hack).

jodros commented 8 months ago

case 1 (nothing fancy)

...... anti-
inflamatório

Or case 2 (repeated hyphens):

....... anti-
-inflamatório

@Omikhleia I've just checked for examples in a reference grammar[^*] in the part about hyphens, and indeed all the examples testify in favor of case 2.

[^*]: Gramática da língua portuguesa padrão by Amini Hauy (Grammar of standard Portuguese)

Omikhleia commented 8 months ago

For Basque (which we support, code eu), this orthotypography manual p. 53 and this other more general document p. 47

= Both seem to contradict the repetition of hyphens (marratxoa) mentioned in LaTeX Babel some discussions (EDIT: Babel is not mentioning it, my bad, other sources cited above did).

"Lerro-bukaerako marratxoa hitz-elkarketarena izanez gero, ez dago marratxo hori errepikatu beharrik hurrengo lerroaren hasieran." --> Google translated: "If the hyphen at the end of the line belongs to the combination of words, there is no need to repeat that hyphen at the beginning of the next line."

And the second document even illustrates the wrong usage (marked with an asterisk) and the correct one.

image

--> So no for Basque, in the general case. (I did see various posts on the web from people asking how to do it, but official recommendations seem to disfavor it)