Closed deepakjois closed 1 year ago
Wow that's special. It looks like something in the BIDI chain is interupted and not started again from the right place. Thanks for reporting this.
There are some clues to the problem in that this happens only when page breaks are calculated automatically. Looks like some (already shaped) nodes are being re-reordered.
I turned off RTL (i.e. removed direction=RTL
option) to narrow down the issue.
Then, looking at the debug output just after and before the page break sheds some light on the problem. It seems some nodes are ‘pushed back’, and when they are ‘boxed up’ again after the page break, they have already been reordered. But the second ‘boxup’ re-reorders them and leads to the garbled glyphs in the output.
[pagebuilder] No page break here
[typesetter] Leaving hmode
[typesetter] Boxed up H<0pt>^0-0vG<0pt>N<21.6015625pt>^10.5615234375-4.0537109375v(کب)G<4.1015625pt>N<16.2353515625pt>^5.7490234375-0.3212890625v(سے)G<4.1015625pt>N<16.9873046875pt>^5.76953125-3.212890625v(ہیں)G<4.1015625pt>N<18.1015625pt>^9.98046875-7.068359375v(اِسی)G<4.1015625pt>N<37.1328125pt>^10.5615234375-4.935546875v(کوشش)G<4.1015625pt>N<16.10546875pt>^4.45703125-3.212890625v(میں)G<4.1015625pt>N<16.0166015625pt>^9.98046875-3.7119140625v(اب)P(0|-10000)G<4.1015625pt>N<14.1572265625pt>^12.919921875-1.50390625v(کہ)G<4.1015625pt>N<15.845703125pt>^7.13671875-3.349609375v(نیند)G<4.1015625pt>N<14.4033203125pt>^7.79296875-3.076171875v(ہی)G<4.1015625pt>N<42.19140625pt>^14.9638671875-4.5048828125v(بچاےگی)P(0|-10000)G<4.1015625pt>N<27.6513671875pt>^6.822265625-4.7783203125v(سپنوں)G<4.1015625pt>N<16.10546875pt>^4.45703125-3.212890625v(میں)G<4.1015625pt>N<16.337890625pt>^4.9765625-4.83984375v(بھی)G<4.1015625pt>N<37.830078125pt>^10.3291015625-3.25390625v(نضارے)G<4.1015625pt>N<22.736328125pt>^11.3955078125-4.7509765625v(لیکن)P(0|-10000)G<4.1015625pt>N<35.423828125pt>^7.02734375-4.0263671875v(سنہرے)G<4.1015625pt>N<15.462890625pt>^9.775390625-3.2607421875v(ملتے)G<4.1015625pt>N<18.1015625pt>^7.02734375-4.0263671875v(نہیں)G<0pt>P(1|-10000)
[typesetter] Considering leading between self two lines
[typesetter] Depth of previous line was 4.0263671875
[typesetter] Leading height = 27pt - 10.5615234375 - 4.0263671875 = 12.412109375
[typesetter] adding penalty of 3000 after VB<10.5615234375|VB[hbox کب سے ہیں اِسی کوشش میں اب(!) hbox]v7.068359375)
[typesetter] Considering leading between self two lines
[typesetter] Depth of previous line was 7.068359375
[typesetter] Leading height = 27pt - 14.9638671875 - 7.068359375 = 4.9677734375
[typesetter] Considering leading between self two lines
[typesetter] Depth of previous line was 4.5048828125
[typesetter] Leading height = 27pt - 11.3955078125 - 4.5048828125 = 11.099609375
[typesetter] adding penalty of 3000 after VB<11.3955078125|VB[ سپنوں میں بھی نضارے لیکن(!) hbox]v4.83984375)
[typesetter] Considering leading between self two lines
[typesetter] Depth of previous line was 4.83984375
[typesetter] Leading height = 27pt - 9.775390625 - 4.83984375 = 12.384765625
[pagebuilder] Page builder for frame content called with 57 nodes, 505.98425745
[pagebuilder] Dealing with VBox VG<12.384765625pt>
[pagebuilder] I have 3.193241825pts left
[pagebuilder] Dealing with VBox VB<9.775390625|VB[ نہیں ملتے سنہرے (!) hbox]v4.0263671875)
[pagebuilder] I have -10.6085159875pts left
[pagebuilder] Dealing with VBox VG<18pt>
[pagebuilder] I have -28.6085159875pts left
[pagebuilder] Badness: 1073741823
[pagebuilder] outputting
[pagebuilder] Glues for self page adjusted by -9.5361719958333
[pagebuilder] OUTPUTTING frame content
[1] [typesetter] Pushing back 10 nodes
[typesetter] Leaving hmode
[typesetter] Boxed up H<0pt>^0-0vG<0pt>N<16.0166015625pt>^9.98046875-3.7119140625v(اب)G<4.1015625pt>N<16.10546875pt>^4.45703125-3.212890625v(میں)G<4.1015625pt>N<37.1328125pt>^10.5615234375-4.935546875v(کوشش)G<4.1015625pt>N<18.1015625pt>^9.98046875-7.068359375v(اِسی)G<4.1015625pt>N<16.9873046875pt>^5.76953125-3.212890625v(ہیں)G<4.1015625pt>N<16.2353515625pt>^5.7490234375-0.3212890625v(سے)G<4.1015625pt>N<21.6015625pt>^10.5615234375-4.0537109375v(کب)P(0|-10000)H<0pt>^0-0vG<4.1015625pt>N<42.19140625pt>^14.9638671875-4.5048828125v(بچاےگی)G<4.1015625pt>N<14.4033203125pt>^7.79296875-3.076171875v(ہی)G<4.1015625pt>N<15.845703125pt>^7.13671875-3.349609375v(نیند)G<4.1015625pt>N<14.1572265625pt>^12.919921875-1.50390625v(کہ)P(0|-10000)H<0pt>^0-0vG<4.1015625pt>N<22.736328125pt>^11.3955078125-4.7509765625v(لیکن)G<4.1015625pt>N<37.830078125pt>^10.3291015625-3.25390625v(نضارے)G<4.1015625pt>N<16.337890625pt>^4.9765625-4.83984375v(بھی)G<4.1015625pt>N<16.10546875pt>^4.45703125-3.212890625v(میں)G<4.1015625pt>N<27.6513671875pt>^6.822265625-4.7783203125v(سپنوں)P(0|-10000)H<0pt>^0-0vG<4.1015625pt>N<18.1015625pt>^7.02734375-4.0263671875v(نہیں)G<4.1015625pt>N<15.462890625pt>^9.775390625-3.2607421875v(ملتے)G<4.1015625pt>N<35.423828125pt>^7.02734375-4.0263671875v(سنہرے)G<0pt>P(1|-10000)
[typesetter] adding penalty of 3000 after VB<10.5615234375|VB[hbox اب میں کوشش اِسی ہیں سے کب(!) hbox]v7.068359375)
As a test, please add this code to your .sil
file:
\begin{script} SILE.typesetter.pushBack = function (self) self:runHooks("newframe") end \end{script}
It's not a fix but it will tell me if the problem is with the pushBack routine.
On adding the code as you asked, SILE seems to render the document as I expect it to. The garbling of glyphs as shown above does not happen anymore.
What information are you looking for, exactly? Should I upload the PDF? It would also be nice to explain why this code above seems to not garble the glyphs.
That's great, thanks.
What is happening is that as SILE builds a page, it turns characters into glyphs, into lines, into pages. But when it determines where a page break should occur in a stream of lines, it may have some lines ("vboxes" in SILE) left over which which don't fit on the current page. These are already assembled into vboxes and SILE could just dump them on the next page.
The problem is that the next page is not guaranteed to have exactly the same frame size as the current page. So all that material that was assembled into those lines needs to come out of a vbox again and all the characters within the vbox need to go back onto the queue, so that the line builder can have another look at them with the line size of the next page's frame.
The code that I gave you above says "don't unbox the vboxes, just shove them on the output queue as they are". It's implicitly a guarantee that the next page's frame will have the same size as the current page's frame. That's not a good general-case solution, but it shows me that the problem that you are seeing is happening when SILE unboxes a vbox and sends all that material back to the line builder for reprocessing. (What is probably happening is that when it gets to the line builder, the line builder runs the BIDI algorithm a second time.)
Thanks for the explanation. It’s definitely tricky. Conceptually, the best thing to do is to throw away the vboxes, time-travel back to the point in the code where they started getting generated, and just reprocess the code itself. But I sense that might be very difficult to do as well.
OK, I think I have fixed this although I am not terribly proud of it.
Yes, it works well, but that piece of code definitely does look like technical debt :)
FWIW, I was doing some reading about this text direction stuff. There are some interesting things in the LuaTeX reference manual, Chapter 3 about the data structures it uses for nodes. Might be useful to refer to it if you get around to do some refactoring of nodes. I also found some slides here useful to understand LuaTeX’s pardir
and textdir
primitives.
Except that LuaTeX’s (actually Omega) text direction model have never been fully exercised, so it looks nice on paper until you actually try it. If one is looking for inspirations, there are tens of applications that handle text direction (and are actually used by users) to take inspiration from.
Rethinking it this morning, I think the right thing to do is to replace nodes which are being pushed back with the original unshaped nodes they came from. This will cause the bidi process to be run again in full. I think we keep those original unshaped nodes around as an attribute on the nodes in the output queue, so it should be possible to reconstruct the original sequence of nodes.
@khaledhosny Could you give some examples, please? I am definitely no expert on text direction models, so looking to learn more :)
The data structures in chapter 3 of the LuaTeX reference manual look pretty well thought out though. So it is worth taking inspiration from. In fact, SILE seems to implement something very similar already, going by the code in core/nodefactory.lua
.
The whole thing was never tested, no one did any complex RTL documents with it. If you want to implement Unicode BiDi algorithm on top of it (it is the 21 century!) good luck, you have to deal with all weird and undocumented behaviour of different modes and boxes, I tried and the result was very fragile and whenever your start using TeX in a slightly advanced way, things will start to break apart.
The only lesson to be learned from LuaTeX/Omega is to _not_ do whatever they did.
Sorry, I wasn’t clear. My question was directed more towards your other comment. Could you give me some examples of software out there that implements the RTL model well, or well enough? I would like to study and understand it in detail if possible.
While searching on the internet I came across a lot of discussions on the TeX mailing lists, where you were one of the participants. So I understand where you are coming from, regarding Omega/TeX.
Personally, I don’t really care about TeX as much as I care about being able to typeset Urdu (and Devanagari for that matter, but anything that can handle Urdu should be able to handle Devanagari) text nicely on a page. It is really frustrating that even today (it is the 21st century after all! :>) I can’t find a decent open source typesetting engine that can set Urdu well.
SILE actually does pretty well, even though it is really very primitive at the moment (not meant as a slight, it is after all fairly new). Even SILE wasn’t designed to support RTL from the very beginning (as can be seen by bugs like these). But it has a pretty small codebase which I could understand fully in one sitting. I wish to contribute towards improving non-Latin typesetting with it (esp Indic and RTL languages) for purely selfish reasons, but before that I want to understand what existing engines do.
XeTeX (with bidi package) is the least worst alternative among others. I am using it for now to do serious typesetting. But I find it it a bit too much of a challenge to understand the TeX macro language (despite having programmed professionally before), which is needed in many situations.
If LuaTeX had harfbuzz, I could have considered using it. but not sure when that is going to happen. I might try to do it myself if I feel like there is no other alternative :). The existing shaping engine inside LuaTeX is really crap. So much for this being the holy grail, as some people on the internet would like to believe!
For e.g, here is a screenshot for LuaLaTeX (makes me cry):
vs. XeLaTeX:
Ah well, I am sure you have seen/heard all of this before. I will continue looking into this whenever time and energy permits. But if you have any pointers from your earlier experience, it will be nice to know about it.
I mean software like Pango, major web browsers or office suits, all implement RTL support in a very satisfactory way. TeX world are still struggling to get out of the 80s, so you see lets of discussions and re-discovery of what the rest of software industry have been done with for decades.
From my perspective, the problem is that all this RTL best-practice is not available in ways that are particularly easy to access. Office suites and web browsers are huge systems which (necessarily) wrap everything in many layers of abstraction and interconnected components. At the same time, reading UAX 9 and UAX 14 and trying to implement RTL text at the presentation layer is like learning to drive a car by reading the engine maintenance manual. And none of the Unicode books do a good job of describing what you need to do to support RTL properly.
OK, I have now tried the clever solution (going back to the original unshaped nodes) and it doesn't work. This is because (a) if the unshaped node contains a paragraph full of text, you may already dealt with part of the text when you hit the page break, so you don't want to push back the whole paragraph, (just the bit that hasn't been output yet) and (b) if the text contained in the node doesn't fit in the upcoming frame, you end up pushing it back and back and back ad infinitum. So I'm sticking with the stupid solution for the time being until someone demonstrates it to be inadequate.
I'm working on setting up local tests in #592 and was unable to replicate what I think was the breaking condition for tests/sura-2.sil
. That being said I do have it outputting something that is clearly wrong, but possibly for other reasons. As I'm not entirely sure what the pushback bug is/was I don't know if it's the same as this, but I'm marking the test as known-bad and moving on. My current output is:
The test output looks correct to me. There is a line breaking issue in the first line of the second page, but I think it is just an overflowing box. Running the test now gives slightly different output (I think because of a different version of AmiriQuran), but I think the original issue is fixed.
Here is some code to typeset two Urdu poems. The first Urdu poem consists of 6 stanzas of 4 lines each. In order to replicate the problem, the last 3 stanzas are exactly the same. The second Urdu poem is short and just spans a single page.
Output PDF here: https://www.dropbox.com/s/1prkct10yfmx9zb/urdu.pdf?dl=0
Here is the screenshot of the buggy portion at the beginning of page 2:
If you notice, the first stanza on page 2 is typeset really weirdly. It should basically look the same as the stanza after and before it.
This seems to happen only when an automatic page break occurs. When I force a page break with
\eject\par
as shown in the example above, this problem does not manifest itself.