Hyphenation problems in Portuguese

jodros commented 4 months ago

One word I noticed to also have some trouble in being hyphenated is quando.

Yes, I know the first example isn't the best in terms of readability, but it's what I've right now since I'm trying the parallel package for now, I could give more examples for Russian soon...

jodros commented 4 months ago

Another format of the first example, this time I couldn't see the frames because -d frames isn't working well when I use parallel...

jodros commented 4 months ago

alerque commented 4 months ago

Lets split PT/RU into different issues because tracking down language-specific stuff doesn't always get resolved at the same time/via the same PR. Lets make this issue the PT one please.

For hyphenation issues, the first thing to check is if we even have break points to work with. Evidently not:

$ ./sile
SILE v0.14.17.r373-g72965ad (LuaJIT 2.1.1700206165) [Rust]
> SILE.showHyphenationPoints("quando", "pt")
quando
> SILE.showHyphenationPoints("apaziguam", "pt")
apa-zi-guam

So at least for "quando", for some reason the patterns are not allowing any hyphenation there. According to PT language rules, where should the points be?

The screen shots are kind of hard to work with for this because I can't tell if the problem is other metrics (like not having any stretch available) might be contributing to poor break choices. Also I can't even be sure I'm typing the same text as you are entering in many cases. Can you post the actual XMl/SIL input files you're testing too?

Omikhleia commented 4 months ago

Unless I misunderstood the screenshot, it doesn't look as an hyphenation issue, but rather a justification issue (overfull lines)

These examples have fairly short columns: have you tried loosening the justification constraints?

As for TeX, by default, overfull lines are preferred over underfull lines when the constraints cannot be respected (on space stretching/shrinking, etc.).

You can try tweaking, in order:

linebreak.emergencyStretch (e.g. set it around 1em, it's a delicate setting)
linebreak.tolerance (defaults to 500, something around 2000 might be necessary when width is constraint, or even up to 5000 in very short columns)

There are other settings (pretolerance, and even the space stretchability) that might be changed too, but they are more difficult (IMHO) to tweak "correctly".

If this is indeed the issue at stakes, then it pops up quite regularly, e.g. see https://github.com/sile-typesetter/sile/issues/620#issuecomment-1089217814

I know the documentation mentions we use the TeX paragraph shaping and also explains briefly the settings... But perhaps we could make it clear for casual readers (that's quite of a FAQ, even in the TeX world...) -- especially when most Office solution nowadays prefer underfull lines (at the risk of bad paragraphing in most cases).

Note that making these settings dynamically adaptable (e.g. depending on font size and target line width) could be an interesting exercise for an experimental package, as a possible helper to minimize the occurrence of these situations. We can easily modify the typesetter to account for such dynamic approaches, which was harder in old TeX (i.e. at least before LuaTeX added hooks in many places, though I don't know how much "hackability" it would now have here).

Omikhleia commented 4 months ago

(BTW, regarding quando, Typst too doesn't hyphenate it (see https://typst.app/tools/hyphenate/) at this point. It's quite logical, as it uses the same TeX hyphenation patterns as SILE -- but at least it shows it's from these original patterns, and not a SILE-specific issue.)

jodros commented 4 months ago

I ran showHyphenationPoints in some other words with the same issue, and noticed that some of them are indeed missing the rules, e.g.

pri-meiro
re-cordo
to-mado
vai-dade
mal-dito

jodros commented 4 months ago

You can try tweaking, in order:

linebreak.emergencyStretch (e.g. set it around 1em, it's a delicate setting)
linebreak.tolerance

I've tested and confirm that sometimes this solved the problem, thanks.

Omikhleia commented 4 months ago

I ran showHyphenationPoints in some other words with the same issue, and noticed that some of them are indeed missing the rules, e.g.
* pri-meiro

* re-cordo

* to-mado

* vai-dade

* mal-dito

But what should they be? SILE and Typst both use the TeX patterns, and both software show the same hyphenation points here, don't they?

jodros commented 4 months ago

But what should they be?

I forgot to tell, they should be:

pri-mei-ro
re-cor-do
to-ma-do
vai-da-de
mal-di-to

Omikhleia commented 4 months ago

SILE is using (a Lua port of) https://github.com/hyphenation/tex-hyphen/blob/master/hyph-utf8/tex/generic/hyph-utf8/patterns/tex/hyph-pt.tex

So this is likely an issue for https://github.com/hyphenation/tex-hyphen (though it would be easier then if SILE was able to use TeX patterns directly rather than having its own error-prone re-implementation as a Lua table, or to ship with a conversion script).

Omikhleia commented 4 months ago

(This being said, one can also register exceptions manually, with \hyphenator:add-exceptions)

Omikhleia commented 2 months ago

Unless there's something clear to do here, I am going to suggest closing/rejecting this issue, inactive for 2+ months

Part of it is merely due to tuning configuration options for small columns, which is doable in the existing code base (via emergencyStretch, tolerance, etc.)
Part of it is due to existing TeX patterns as-they-are = Not Our Bug

alerque commented 1 month ago

Just throwing this out there, we are in no way limited to using the hyphenation rules from tex-hyphen as is. We can correct them locally in our vendored copy when appropriate, submit fixes upstream if needed, and even use different hyphenation code for different languages. Particularly with the Rust wrapper there are several libraries we could surface.

If something is still wrong here (@jodros any references to official grammar guides and/or other discussion on implementations anywhere that help confirm this is a bug) I'd like to actually look into what it is. There may always be exceptions not covered by a codifiable rule, but even if that case we can add exceptions by default if they are well known and agreed on.

jodros commented 1 month ago

We can correct them locally in our vendored copy when appropriate, submit fixes upstream if needed

I'm glad to read this.

Well, I've just take a look at languages/pt.lua right

  \begin{document}   
  \language[main=pt]   

  \script{     
    local words = { "quando","econômico", "recordo", "tomado", "vaidade", "maldito", "fonética","aproveitado" }   

    for _, word in ipairs(words) do   
        SILE.typesetter:typeset(SILE.showHyphenationPoints(word, "pt"))   
        SILE.call("par")   
    end   
  }   

  \end{document}

Which gave me:

The only rule I found missing in the file is 1nô, and after have added I got eco-nô-mico.

Now, regarding the remaining syllables as -do -co -ca -to , it seems to me that we've indeed a bug, because they are all declared in the list of patterns...

Omikhleia commented 1 month ago

@jodros

it seems to me that we've indeed a bug,

Yes, and I guess I quite understand it now. It partly relates to #2017 with possibly an additional error in our implementation

Our code set the left/right constraints to (2, 2).
Note that this is also the default for Liang's algorithm in TeX, but many languages set it to (2, 3) as I reported in #2017
Note that the recommended settings for Portuguese likely are (2, 3): https://github.com/hyphenation/tex-hyphen/blob/ecf976ab6995acb653d38ab1af0b9b9829ec0c77/hyph-utf8/tex/generic/hyph-utf8/patterns/tex/hyph-pt.tex#L46-L49
- So technically it would be right to hyphenate, say, "eco-no-mi-co"...
- But the standard practice is to enforce "eco-no-mico", because a split at 2 letters at the end of the line is considered "bad typography"

Anyway, since we are using the default hard-coded (2, 2), why don't we get "eco-no-mi-co" indeed.

Ho ho, weird indeed .... but maybe there's an issue somewhere with Lua lists being 1-based and not 0-based?

See:

SILE v0.14.17 (Lua 5.2)
> SILE.showHyphenationPoints("economico", "pt")
eco-no-mico
> 
> SILE._hyphenators["pt"].rightmin = 1
> 
> SILE.showHyphenationPoints("economico", "pt")
eco-no-mi-co
>

I think the issue is here:

https://github.com/sile-typesetter/sile/blob/91cf578a3cb0b0289e5650dd66c48f5cdccf69c0/core/hyphenator-liang.lua#L95-L101

Before applying the constraints, we have

points:  0 1 0 1 0 1 0 1 0 0 
word:     e c o n o m i c o    --> e-co-no-mi-co

After applying the leftmin

points:  0 0 0 1 0 1 0 1 0 0 
word:     e c o n o m i c o    --> eco-no-mi-co

And after applying the rightmin

points:  0 0 0 1 0 1 0 0 0 0 
word:     e c o n o m i c o    --> eco-no-mico

So we think we are using (2, 2), but we actually behave as (2, 3)... Which might be why #2017 failed to be noticed (English also being recommended at (2, 3) for standard typography...): A bug was hiding another.

I think the code should be:

for i = #points-self.rightmin+1, #points do points[i] = 0 end

But then I don't understand any longer the root problem I had which triggered me to open #2017, I'll re-investigate it... There might be more that meets the eye here...

Any thoughts and insights?[^1]

[^1]: Besides the fact that no so long ago, SILE didn't know how to properly justify lines. It does know, it seems, how to properly perform hyphenation. It doesn't know how to properly break pages. Erm. :pig:

Omikhleia commented 1 month ago

@jodros

The only rule I found missing in the file is "1nô"

By the way, it's not missing, unless I am mistaken: it's just our current hyphenation patterns (coming from TeX) were likely crafted based on Portuguese from Portugal, and all dictionaries seem to have "económico"...

But according from some online resources, "econômico" is from Brazil (grafia no Brasil). It could be interesting to confirm. And if so, I still think it would be a good question to https://github.com/hyphenation/tex-hyphen ... Because even if "we are in no way limited to using the hyphenation rules from tex-hyphen as is", the general solution here would be to support BCP47 and possibly have different hyphenation patterns for different language variants. Admittedly, here it is quite possible that the introduction of this "1nô" in standard Portuguese wouldn't harm it much (I don't know!), but the general picture is that some specificity might need different patterns[^1]

[^1]: And BCP47 was discussed long ago :pig: Until it happens, SILE doesn't know well how to handle language codes and scripts... TeX has different patterns for German 1901 orthography and 1996 revised orthography (de-1901, de-1996), patterns for Serbian in latin or cyrillic, etc.

Omikhleia commented 1 month ago

So let's recap as the issue got long with several things:

Part of it is merely due to tuning configuration options for small columns, which is doable in the existing code base (via emergencyStretch, tolerance, etc.)
Part of it is due to existing TeX patterns as-they-are = Not Our Bug
Part of it is dues to bugs in our Liang hyphenation implementation, relating to #2017 but overshadowing another likely bug in the handling of the hyphenation rightmin :lady_beetle:
Part of it is due to differences between "pt" (canonical Portuguese from Portugal) and "pt-BR" (Portuguese from Brazil)

jodros commented 1 month ago

Ho ho, weird indeed .... but maybe there's an issue somewhere with Lua lists being 1-based and not 0-based?

Interesting note.

according from some online resources, "econômico" is from Brazil

Yes, that's the Brazilian spelling. Maybe there are even other minor differences to be found...

Part of it is dues to bugs in our Liang hyphenation implementation, relating to https://github.com/sile-typesetter/sile/issues/2017 but overshadowing another likely bug in the handling of the hyphenation rightmin 🐞

Since most of the issues I had were solved by changing `linebreak.emergencyStretch, this is the only remaining point to take of now.

Omikhleia commented 3 weeks ago

Yes, that's the Brazilian spelling. Maybe there are even other minor differences to be found...

Noted: https://github.com/hyphenation/tex-hyphen/issues/61

Omikhleia commented 3 weeks ago

Maybe there are even other minor differences to be found...

Likely: I came accross "antónimo" vs. "antônimo" in a translation file.

jodros commented 3 weeks ago

Likely: I came accross "antónimo" vs. "antônimo" in a translation file.

I'm gonna make a list with all major differences soon...

alerque commented 1 week ago

As I understand it everything this issue needs to track is taken care of except perhaps documentation on all the things that can be done to cope with narrow text width as gracefully as possible. Lets open an issue specific to that.

sile-typesetter / sile

Hyphenation problems in Portuguese #2001