Closed joaquimcampos closed 2 years ago
I believe the issue is that the text extraction is identifying different lines
as belonging to different blocks
, and TEXT_DEHYPHENATE
only joins lines and spans within the same block.
Ah, have you confirmed this is the case here? I have starte studying the file, but I didn't look at that detail yet. If the lines indeed are in different blocks, then you are quite right ...
Just tested it: you are right! Every line is in its own block. So indeed dehyphenation cannot work. The algorithm behind bringing text into the block/line/span hierarchy (located within MuPDF) takes a bunch of criteria into account like inter-line distance, font size, font characteristics (ascender, descender) and more ... but no interpretation of the text itself.
In this case, each line height is 12.74. The distance between a line's bottom to the next line's top is 4.3.
Also - as a preliminary analysis shows - each line is coded in its own PDF text object, i.e. wrapped in its own string pairs BT
/ET
.
Obviously, taken together this was too much for MuPDF to put the lines in the same blocks.
So you were having the right idea - this example is not suitable for dehyphenation.
Based on the insight presented by your example, we will insert a comment in the documentation.
I'll be sure to update https://pymupdf.readthedocs.io/en/latest/vars.html?highlight=dehyphenate#TEXT_DEHYPHENATE with some notes soon. Going forward, maybe we could parameterise line-height or something alongside this flag so that lines are considered to be part of the same block? No idea if that is something which is feasible or not.
I'll be sure to update https://pymupdf.readthedocs.io/en/latest/vars.html?highlight=dehyphenate#TEXT_DEHYPHENATE with some notes soon. Going forward, maybe we could parameterise line-height or something alongside this flag so that lines are considered to be part of the same block? No idea if that is something which is feasible or not.
I am afraid this would have to happen inside MuPDF's text page logic. Any change we may want to introduce has consequences that also apply to things like text search - not yet talking about that subsequent lines may not have the same inclination angle. Also, if text is not coded in reading sequence, the whole thing breaks down anyway. We might think about increasing the threshold WRT inter-line distances - which in this case seems to be the one reason why each line lives in its own block. As per today, there are no attempts inside PyMuPDF to interfere here - PyMuPDF just passes the text flags bit field on to MuPDF's text page creation.
I think this issue has now turned into a discussion item, so let me transfer it to there.
" We might think about increasing the threshold WRT inter-line distances - which in this case seems to be the one reason why each line lives in its own block."
I think this is a wise choice since visually the lines do seem to belong in the same block.
I have written my own python code to merge blocks where the last line of the first and first line of the next fit some criteria (relative vertical distance, horizontal position, etc.). This solved the issue.
Bug report
Running text extraction with
TEXT_DEHYPHENATE
does not produce the expected behaviour for the following pdf: issue_one_page.pdf. (But it does work correctly on other pages...)To reproduce, run the following code on the pdf issue_one_page.pdf.
This gives