Closed jodros closed 1 week ago
Another format of the first example, this time I couldn't see the frames because -d frames
isn't working well when I use parallel
...
Lets split PT/RU into different issues because tracking down language-specific stuff doesn't always get resolved at the same time/via the same PR. Lets make this issue the PT one please.
For hyphenation issues, the first thing to check is if we even have break points to work with. Evidently not:
$ ./sile
SILE v0.14.17.r373-g72965ad (LuaJIT 2.1.1700206165) [Rust]
> SILE.showHyphenationPoints("quando", "pt")
quando
> SILE.showHyphenationPoints("apaziguam", "pt")
apa-zi-guam
So at least for "quando", for some reason the patterns are not allowing any hyphenation there. According to PT language rules, where should the points be?
The screen shots are kind of hard to work with for this because I can't tell if the problem is other metrics (like not having any stretch available) might be contributing to poor break choices. Also I can't even be sure I'm typing the same text as you are entering in many cases. Can you post the actual XMl/SIL input files you're testing too?
Unless I misunderstood the screenshot, it doesn't look as an hyphenation issue, but rather a justification issue (overfull lines)
These examples have fairly short columns: have you tried loosening the justification constraints?
As for TeX, by default, overfull lines are preferred over underfull lines when the constraints cannot be respected (on space stretching/shrinking, etc.).
You can try tweaking, in order:
linebreak.emergencyStretch
(e.g. set it around 1em, it's a delicate setting)linebreak.tolerance
(defaults to 500, something around 2000 might be necessary when width is constraint, or even up to 5000 in very short columns)There are other settings (pretolerance, and even the space stretchability) that might be changed too, but they are more difficult (IMHO) to tweak "correctly".
If this is indeed the issue at stakes, then it pops up quite regularly, e.g. see https://github.com/sile-typesetter/sile/issues/620#issuecomment-1089217814
I know the documentation mentions we use the TeX paragraph shaping and also explains briefly the settings... But perhaps we could make it clear for casual readers (that's quite of a FAQ, even in the TeX world...) -- especially when most Office solution nowadays prefer underfull lines (at the risk of bad paragraphing in most cases).
Note that making these settings dynamically adaptable (e.g. depending on font size and target line width) could be an interesting exercise for an experimental package, as a possible helper to minimize the occurrence of these situations. We can easily modify the typesetter to account for such dynamic approaches, which was harder in old TeX (i.e. at least before LuaTeX added hooks in many places, though I don't know how much "hackability" it would now have here).
(BTW, regarding quando, Typst too doesn't hyphenate it (see https://typst.app/tools/hyphenate/) at this point. It's quite logical, as it uses the same TeX hyphenation patterns as SILE -- but at least it shows it's from these original patterns, and not a SILE-specific issue.)
I ran showHyphenationPoints
in some other words with the same issue, and noticed that some of them are indeed missing the rules, e.g.
You can try tweaking, in order:
linebreak.emergencyStretch (e.g. set it around 1em, it's a delicate setting)
linebreak.tolerance
I've tested and confirm that sometimes this solved the problem, thanks.
I ran
showHyphenationPoints
in some other words with the same issue, and noticed that some of them are indeed missing the rules, e.g.* pri-meiro * re-cordo * to-mado * vai-dade * mal-dito
But what should they be? SILE and Typst both use the TeX patterns, and both software show the same hyphenation points here, don't they?
But what should they be?
I forgot to tell, they should be:
SILE is using (a Lua port of) https://github.com/hyphenation/tex-hyphen/blob/master/hyph-utf8/tex/generic/hyph-utf8/patterns/tex/hyph-pt.tex
So this is likely an issue for https://github.com/hyphenation/tex-hyphen (though it would be easier then if SILE was able to use TeX patterns directly rather than having its own error-prone re-implementation as a Lua table, or to ship with a conversion script).
(This being said, one can also register exceptions manually, with \hyphenator:add-exceptions
)
Unless there's something clear to do here, I am going to suggest closing/rejecting this issue, inactive for 2+ months
Just throwing this out there, we are in no way limited to using the hyphenation rules from tex-hyphen as is. We can correct them locally in our vendored copy when appropriate, submit fixes upstream if needed, and even use different hyphenation code for different languages. Particularly with the Rust wrapper there are several libraries we could surface.
If something is still wrong here (@jodros any references to official grammar guides and/or other discussion on implementations anywhere that help confirm this is a bug) I'd like to actually look into what it is. There may always be exceptions not covered by a codifiable rule, but even if that case we can add exceptions by default if they are well known and agreed on.
We can correct them locally in our vendored copy when appropriate, submit fixes upstream if needed
I'm glad to read this.
Well, I've just take a look at languages/pt.lua
right
\begin{document}
\language[main=pt]
\script{
local words = { "quando","econômico", "recordo", "tomado", "vaidade", "maldito", "fonética","aproveitado" }
for _, word in ipairs(words) do
SILE.typesetter:typeset(SILE.showHyphenationPoints(word, "pt"))
SILE.call("par")
end
}
\end{document}
Which gave me:
The only rule I found missing in the file is 1nô
, and after have added I got eco-nô-mico
.
Now, regarding the remaining syllables as -do
-co
-ca
-to
, it seems to me that we've indeed a bug, because they are all declared in the list of patterns...
@jodros
it seems to me that we've indeed a bug,
Yes, and I guess I quite understand it now. It partly relates to #2017 with possibly an additional error in our implementation
Anyway, since we are using the default hard-coded (2, 2), why don't we get "eco-no-mi-co" indeed.
Ho ho, weird indeed .... but maybe there's an issue somewhere with Lua lists being 1-based and not 0-based?
See:
SILE v0.14.17 (Lua 5.2)
> SILE.showHyphenationPoints("economico", "pt")
eco-no-mico
>
> SILE._hyphenators["pt"].rightmin = 1
>
> SILE.showHyphenationPoints("economico", "pt")
eco-no-mi-co
>
I think the issue is here:
Before applying the constraints, we have
points: 0 1 0 1 0 1 0 1 0 0
word: e c o n o m i c o --> e-co-no-mi-co
After applying the leftmin
points: 0 0 0 1 0 1 0 1 0 0
word: e c o n o m i c o --> eco-no-mi-co
And after applying the rightmin
points: 0 0 0 1 0 1 0 0 0 0
word: e c o n o m i c o --> eco-no-mico
So we think we are using (2, 2), but we actually behave as (2, 3)... Which might be why #2017 failed to be noticed (English also being recommended at (2, 3) for standard typography...): A bug was hiding another.
I think the code should be:
for i = #points-self.rightmin+1, #points do points[i] = 0 end
But then I don't understand any longer the root problem I had which triggered me to open #2017, I'll re-investigate it... There might be more that meets the eye here...
Any thoughts and insights?[^1]
[^1]: Besides the fact that no so long ago, SILE didn't know how to properly justify lines. It does know, it seems, how to properly perform hyphenation. It doesn't know how to properly break pages. Erm. :pig:
@jodros
The only rule I found missing in the file is "1nô"
By the way, it's not missing, unless I am mistaken: it's just our current hyphenation patterns (coming from TeX) were likely crafted based on Portuguese from Portugal, and all dictionaries seem to have "económico"...
But according from some online resources, "econômico" is from Brazil (grafia no Brasil). It could be interesting to confirm. And if so, I still think it would be a good question to https://github.com/hyphenation/tex-hyphen ... Because even if "we are in no way limited to using the hyphenation rules from tex-hyphen as is", the general solution here would be to support BCP47 and possibly have different hyphenation patterns for different language variants. Admittedly, here it is quite possible that the introduction of this "1nô" in standard Portuguese wouldn't harm it much (I don't know!), but the general picture is that some specificity might need different patterns[^1]
[^1]: And BCP47 was discussed long ago :pig: Until it happens, SILE doesn't know well how to handle language codes and scripts... TeX has different patterns for German 1901 orthography and 1996 revised orthography (de-1901, de-1996), patterns for Serbian in latin or cyrillic, etc.
So let's recap as the issue got long with several things:
Ho ho, weird indeed .... but maybe there's an issue somewhere with Lua lists being 1-based and not 0-based?
Interesting note.
according from some online resources, "econômico" is from Brazil
Yes, that's the Brazilian spelling. Maybe there are even other minor differences to be found...
Part of it is dues to bugs in our Liang hyphenation implementation, relating to https://github.com/sile-typesetter/sile/issues/2017 but overshadowing another likely bug in the handling of the hyphenation rightmin 🐞
Since most of the issues I had were solved by changing `linebreak.emergencyStretch
, this is the only remaining point to take of now.
Yes, that's the Brazilian spelling. Maybe there are even other minor differences to be found...
Maybe there are even other minor differences to be found...
Likely: I came accross "antónimo" vs. "antônimo" in a translation file.
Likely: I came accross "antónimo" vs. "antônimo" in a translation file.
I'm gonna make a list with all major differences soon...
As I understand it everything this issue needs to track is taken care of except perhaps documentation on all the things that can be done to cope with narrow text width as gracefully as possible. Lets open an issue specific to that.
One word I noticed to also have some trouble in being hyphenated is
quando
.Yes, I know the first example isn't the best in terms of readability, but it's what I've right now since I'm trying the
parallel
package for now, I could give more examples for Russian soon...