sile-typesetter / sile

The SILE Typesetter — Simon’s Improved Layout Engine
https://sile-typesetter.org
MIT License
1.63k stars 96 forks source link

RTL typesetting broken when pagebreak happens #192

Closed deepakjois closed 1 year ago

deepakjois commented 8 years ago

Here is some code to typeset two Urdu poems. The first Urdu poem consists of 6 stanzas of 4 lines each. In order to replicate the problem, the last 3 stanzas are exactly the same. The second Urdu poem is short and just spans a single page.

\begin[papersize=a5,direction=RTL]{document}
\set[parameter=document.parindent,value=0pt]
\nofolios
\font[family=Scheherazade,style=Bold,language=urd,size=36pt]
\center{لکیریں}
\medskip
\set[parameter=document.baselineskip,value=0.75em]
\set[parameter=document.parskip,value=0.5em]
\font[family=Amiri,style=Regular,language=urd,size=14pt]
\begin{raggedright}
لکیریں کھینچیں ہیں کاغز پہ مگر\break
کہانی کویٔ بنتی نہیں\break 
کصّے تمام یہں میرے پاس مگر\break 
حقیقت کویٔ ملتی نہیں

کھڈے ہوۓ ہیں اس موڑ پر \break
کھو جانے کے لۓ تیّار\break 
نقشے سارے جلا دۓ ہیں لیکن\break 
گُمراہی ملتی نہیں

کب سے ہیں اِسی کوشش میں اب\break 
کہ نیند ہی بچاےگی\break 
سپنوں میں بھی نضارے لیکن\break 
سنہرے ملتے نہیں

کب سے ہیں اِسی کوشش میں اب\break 
کہ نیند ہی بچاےگی\break 
سپنوں میں بھی نضارے لیکن\break 
سنہرے ملتے نہیں

کب سے ہیں اِسی کوشش میں اب\break 
کہ نیند ہی بچاےگی\break 
سپنوں میں بھی نضارے لیکن\break 
سنہرے ملتے نہیں

کب سے ہیں اِسی کوشش میں اب\break 
کہ نیند ہی بچاےگی\break 
سپنوں میں بھی نضارے لیکن\break 
سنہرے ملتے نہیں
\eject\par
\font[family=Scheherazade,style=Bold,language=urd,size=36pt]
\center{سفر}
\medskip
\font[family=Amiri,style=Regular,language=urd,size=14pt]
سفر میں خود ہی گھایل ہو گیا ہوں\break 
میں اپنا راستہ روکے کھڑا ہوں

میرا گھر دور ھوتا جا رہا ہے\break
میں پیچھے کی ترف چلتا رہا ہوں

میری منزل میرے اندر ھو شاید\break 
جسے خالِدؔ میں باہر ڈھونڈھتا ہوں
\end{raggedright}
\end{document}

Output PDF here: https://www.dropbox.com/s/1prkct10yfmx9zb/urdu.pdf?dl=0

Here is the screenshot of the buggy portion at the beginning of page 2:

screenshot 2015-11-10 22 41 25

If you notice, the first stanza on page 2 is typeset really weirdly. It should basically look the same as the stanza after and before it.

This seems to happen only when an automatic page break occurs. When I force a page break with \eject\par as shown in the example above, this problem does not manifest itself.

alerque commented 8 years ago

Wow that's special. It looks like something in the BIDI chain is interupted and not started again from the right place. Thanks for reporting this.

deepakjois commented 8 years ago

There are some clues to the problem in that this happens only when page breaks are calculated automatically. Looks like some (already shaped) nodes are being re-reordered.

deepakjois commented 8 years ago

I turned off RTL (i.e. removed direction=RTL option) to narrow down the issue.

Then, looking at the debug output just after and before the page break sheds some light on the problem. It seems some nodes are ‘pushed back’, and when they are ‘boxed up’ again after the page break, they have already been reordered. But the second ‘boxup’ re-reorders them and leads to the garbled glyphs in the output.

[pagebuilder]   No page break here
[typesetter]    Leaving hmode
[typesetter]    Boxed up H<0pt>^0-0vG<0pt>N<21.6015625pt>^10.5615234375-4.0537109375v(کب)G<4.1015625pt>N<16.2353515625pt>^5.7490234375-0.3212890625v(سے)G<4.1015625pt>N<16.9873046875pt>^5.76953125-3.212890625v(ہیں)G<4.1015625pt>N<18.1015625pt>^9.98046875-7.068359375v(اِسی)G<4.1015625pt>N<37.1328125pt>^10.5615234375-4.935546875v(کوشش)G<4.1015625pt>N<16.10546875pt>^4.45703125-3.212890625v(میں)G<4.1015625pt>N<16.0166015625pt>^9.98046875-3.7119140625v(اب)P(0|-10000)G<4.1015625pt>N<14.1572265625pt>^12.919921875-1.50390625v(کہ)G<4.1015625pt>N<15.845703125pt>^7.13671875-3.349609375v(نیند)G<4.1015625pt>N<14.4033203125pt>^7.79296875-3.076171875v(ہی)G<4.1015625pt>N<42.19140625pt>^14.9638671875-4.5048828125v(بچاےگی)P(0|-10000)G<4.1015625pt>N<27.6513671875pt>^6.822265625-4.7783203125v(سپنوں)G<4.1015625pt>N<16.10546875pt>^4.45703125-3.212890625v(میں)G<4.1015625pt>N<16.337890625pt>^4.9765625-4.83984375v(بھی)G<4.1015625pt>N<37.830078125pt>^10.3291015625-3.25390625v(نضارے)G<4.1015625pt>N<22.736328125pt>^11.3955078125-4.7509765625v(لیکن)P(0|-10000)G<4.1015625pt>N<35.423828125pt>^7.02734375-4.0263671875v(سنہرے)G<4.1015625pt>N<15.462890625pt>^9.775390625-3.2607421875v(ملتے)G<4.1015625pt>N<18.1015625pt>^7.02734375-4.0263671875v(نہیں)G<0pt>P(1|-10000)
[typesetter]       Considering leading between self two lines
[typesetter]       Depth of previous line was 4.0263671875
[typesetter]       Leading height = 27pt - 10.5615234375 - 4.0263671875 = 12.412109375
[typesetter]    adding penalty of 3000 after VB<10.5615234375|VB[hbox کب سے ہیں اِسی کوشش میں اب(!) hbox]v7.068359375)
[typesetter]       Considering leading between self two lines
[typesetter]       Depth of previous line was 7.068359375
[typesetter]       Leading height = 27pt - 14.9638671875 - 7.068359375 = 4.9677734375
[typesetter]       Considering leading between self two lines
[typesetter]       Depth of previous line was 4.5048828125
[typesetter]       Leading height = 27pt - 11.3955078125 - 4.5048828125 = 11.099609375
[typesetter]    adding penalty of 3000 after VB<11.3955078125|VB[ سپنوں میں بھی نضارے لیکن(!) hbox]v4.83984375)
[typesetter]       Considering leading between self two lines
[typesetter]       Depth of previous line was 4.83984375
[typesetter]       Leading height = 27pt - 9.775390625 - 4.83984375 = 12.384765625
[pagebuilder]   Page builder for frame content called with 57 nodes, 505.98425745
[pagebuilder]   Dealing with VBox VG<12.384765625pt>
[pagebuilder]   I have 3.193241825pts left
[pagebuilder]   Dealing with VBox VB<9.775390625|VB[ نہیں ملتے سنہرے (!) hbox]v4.0263671875)
[pagebuilder]   I have -10.6085159875pts left
[pagebuilder]   Dealing with VBox VG<18pt>
[pagebuilder]   I have -28.6085159875pts left
[pagebuilder]   Badness: 1073741823
[pagebuilder]   outputting
[pagebuilder]   Glues for self page adjusted by -9.5361719958333
[pagebuilder]   OUTPUTTING frame content
[1] [typesetter]        Pushing back 10 nodes
[typesetter]    Leaving hmode
[typesetter]    Boxed up H<0pt>^0-0vG<0pt>N<16.0166015625pt>^9.98046875-3.7119140625v(اب)G<4.1015625pt>N<16.10546875pt>^4.45703125-3.212890625v(میں)G<4.1015625pt>N<37.1328125pt>^10.5615234375-4.935546875v(کوشش)G<4.1015625pt>N<18.1015625pt>^9.98046875-7.068359375v(اِسی)G<4.1015625pt>N<16.9873046875pt>^5.76953125-3.212890625v(ہیں)G<4.1015625pt>N<16.2353515625pt>^5.7490234375-0.3212890625v(سے)G<4.1015625pt>N<21.6015625pt>^10.5615234375-4.0537109375v(کب)P(0|-10000)H<0pt>^0-0vG<4.1015625pt>N<42.19140625pt>^14.9638671875-4.5048828125v(بچاےگی)G<4.1015625pt>N<14.4033203125pt>^7.79296875-3.076171875v(ہی)G<4.1015625pt>N<15.845703125pt>^7.13671875-3.349609375v(نیند)G<4.1015625pt>N<14.1572265625pt>^12.919921875-1.50390625v(کہ)P(0|-10000)H<0pt>^0-0vG<4.1015625pt>N<22.736328125pt>^11.3955078125-4.7509765625v(لیکن)G<4.1015625pt>N<37.830078125pt>^10.3291015625-3.25390625v(نضارے)G<4.1015625pt>N<16.337890625pt>^4.9765625-4.83984375v(بھی)G<4.1015625pt>N<16.10546875pt>^4.45703125-3.212890625v(میں)G<4.1015625pt>N<27.6513671875pt>^6.822265625-4.7783203125v(سپنوں)P(0|-10000)H<0pt>^0-0vG<4.1015625pt>N<18.1015625pt>^7.02734375-4.0263671875v(نہیں)G<4.1015625pt>N<15.462890625pt>^9.775390625-3.2607421875v(ملتے)G<4.1015625pt>N<35.423828125pt>^7.02734375-4.0263671875v(سنہرے)G<0pt>P(1|-10000)
[typesetter]    adding penalty of 3000 after VB<10.5615234375|VB[hbox اب میں کوشش اِسی ہیں سے کب(!) hbox]v7.068359375)
simoncozens commented 8 years ago

As a test, please add this code to your .sil file:

\begin{script} SILE.typesetter.pushBack = function (self) self:runHooks("newframe") end \end{script}

It's not a fix but it will tell me if the problem is with the pushBack routine.

deepakjois commented 8 years ago

On adding the code as you asked, SILE seems to render the document as I expect it to. The garbling of glyphs as shown above does not happen anymore.

What information are you looking for, exactly? Should I upload the PDF? It would also be nice to explain why this code above seems to not garble the glyphs.

simoncozens commented 8 years ago

That's great, thanks.

What is happening is that as SILE builds a page, it turns characters into glyphs, into lines, into pages. But when it determines where a page break should occur in a stream of lines, it may have some lines ("vboxes" in SILE) left over which which don't fit on the current page. These are already assembled into vboxes and SILE could just dump them on the next page.

The problem is that the next page is not guaranteed to have exactly the same frame size as the current page. So all that material that was assembled into those lines needs to come out of a vbox again and all the characters within the vbox need to go back onto the queue, so that the line builder can have another look at them with the line size of the next page's frame.

The code that I gave you above says "don't unbox the vboxes, just shove them on the output queue as they are". It's implicitly a guarantee that the next page's frame will have the same size as the current page's frame. That's not a good general-case solution, but it shows me that the problem that you are seeing is happening when SILE unboxes a vbox and sends all that material back to the line builder for reprocessing. (What is probably happening is that when it gets to the line builder, the line builder runs the BIDI algorithm a second time.)

deepakjois commented 8 years ago

Thanks for the explanation. It’s definitely tricky. Conceptually, the best thing to do is to throw away the vboxes, time-travel back to the point in the code where they started getting generated, and just reprocess the code itself. But I sense that might be very difficult to do as well.

simoncozens commented 8 years ago

OK, I think I have fixed this although I am not terribly proud of it.

deepakjois commented 8 years ago

Yes, it works well, but that piece of code definitely does look like technical debt :)

FWIW, I was doing some reading about this text direction stuff. There are some interesting things in the LuaTeX reference manual, Chapter 3 about the data structures it uses for nodes. Might be useful to refer to it if you get around to do some refactoring of nodes. I also found some slides here useful to understand LuaTeX’s pardir and textdir primitives.

khaledhosny commented 8 years ago

Except that LuaTeX’s (actually Omega) text direction model have never been fully exercised, so it looks nice on paper until you actually try it. If one is looking for inspirations, there are tens of applications that handle text direction (and are actually used by users) to take inspiration from.

simoncozens commented 8 years ago

Rethinking it this morning, I think the right thing to do is to replace nodes which are being pushed back with the original unshaped nodes they came from. This will cause the bidi process to be run again in full. I think we keep those original unshaped nodes around as an attribute on the nodes in the output queue, so it should be possible to reconstruct the original sequence of nodes.

deepakjois commented 8 years ago

@khaledhosny Could you give some examples, please? I am definitely no expert on text direction models, so looking to learn more :)

The data structures in chapter 3 of the LuaTeX reference manual look pretty well thought out though. So it is worth taking inspiration from. In fact, SILE seems to implement something very similar already, going by the code in core/nodefactory.lua.

khaledhosny commented 8 years ago

The whole thing was never tested, no one did any complex RTL documents with it. If you want to implement Unicode BiDi algorithm on top of it (it is the 21 century!) good luck, you have to deal with all weird and undocumented behaviour of different modes and boxes, I tried and the result was very fragile and whenever your start using TeX in a slightly advanced way, things will start to break apart.

The only lesson to be learned from LuaTeX/Omega is to _not_ do whatever they did.

deepakjois commented 8 years ago

Sorry, I wasn’t clear. My question was directed more towards your other comment. Could you give me some examples of software out there that implements the RTL model well, or well enough? I would like to study and understand it in detail if possible.

While searching on the internet I came across a lot of discussions on the TeX mailing lists, where you were one of the participants. So I understand where you are coming from, regarding Omega/TeX.

Personally, I don’t really care about TeX as much as I care about being able to typeset Urdu (and Devanagari for that matter, but anything that can handle Urdu should be able to handle Devanagari) text nicely on a page. It is really frustrating that even today (it is the 21st century after all! :>) I can’t find a decent open source typesetting engine that can set Urdu well.

SILE actually does pretty well, even though it is really very primitive at the moment (not meant as a slight, it is after all fairly new). Even SILE wasn’t designed to support RTL from the very beginning (as can be seen by bugs like these). But it has a pretty small codebase which I could understand fully in one sitting. I wish to contribute towards improving non-Latin typesetting with it (esp Indic and RTL languages) for purely selfish reasons, but before that I want to understand what existing engines do.

XeTeX (with bidi package) is the least worst alternative among others. I am using it for now to do serious typesetting. But I find it it a bit too much of a challenge to understand the TeX macro language (despite having programmed professionally before), which is needed in many situations.

If LuaTeX had harfbuzz, I could have considered using it. but not sure when that is going to happen. I might try to do it myself if I feel like there is no other alternative :). The existing shaping engine inside LuaTeX is really crap. So much for this being the holy grail, as some people on the internet would like to believe!

For e.g, here is a screenshot for LuaLaTeX (makes me cry):

screenshot 2015-11-15 17 51 19

vs. XeLaTeX:

screenshot 2015-11-15 17 57 41

Ah well, I am sure you have seen/heard all of this before. I will continue looking into this whenever time and energy permits. But if you have any pointers from your earlier experience, it will be nice to know about it.

khaledhosny commented 8 years ago

I mean software like Pango, major web browsers or office suits, all implement RTL support in a very satisfactory way. TeX world are still struggling to get out of the 80s, so you see lets of discussions and re-discovery of what the rest of software industry have been done with for decades.

simoncozens commented 8 years ago

From my perspective, the problem is that all this RTL best-practice is not available in ways that are particularly easy to access. Office suites and web browsers are huge systems which (necessarily) wrap everything in many layers of abstraction and interconnected components. At the same time, reading UAX 9 and UAX 14 and trying to implement RTL text at the presentation layer is like learning to drive a car by reading the engine maintenance manual. And none of the Unicode books do a good job of describing what you need to do to support RTL properly.

simoncozens commented 8 years ago

OK, I have now tried the clever solution (going back to the original unshaped nodes) and it doesn't work. This is because (a) if the unshaped node contains a paragraph full of text, you may already dealt with part of the text when you hit the page break, so you don't want to push back the whole paragraph, (just the bit that hasn't been output yet) and (b) if the text contained in the node doesn't fit in the upcoming frame, you end up pushing it back and back and back ad infinitum. So I'm sticking with the stupid solution for the time being until someone demonstrates it to be inadequate.

alerque commented 5 years ago

I'm working on setting up local tests in #592 and was unable to replicate what I think was the breaking condition for tests/sura-2.sil. That being said I do have it outputting something that is clearly wrong, but possibly for other reasons. As I'm not entirely sure what the pushback bug is/was I don't know if it's the same as this, but I'm marking the test as known-bad and moving on. My current output is:

image

khaledhosny commented 1 year ago

The test output looks correct to me. There is a line breaking issue in the first line of the second page, but I think it is just an overflowing box. Running the test now gives slightly different output (I think because of a different version of AmiriQuran), but I think the original issue is fixed.