pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.35k stars 509 forks source link

TEXT_DEHYPHENATE not working properly #1920

Closed joaquimcampos closed 2 years ago

joaquimcampos commented 2 years ago

Bug report

Running text extraction with TEXT_DEHYPHENATE does not produce the expected behaviour for the following pdf: issue_one_page.pdf. (But it does work correctly on other pages...)

To reproduce, run the following code on the pdf issue_one_page.pdf.

import fitz
import click
from fitz.fitz import (
    TEXTFLAGS_TEXT,
    TEXT_DEHYPHENATE
)

def main(pdf_file):

    doc = fitz.open(pdf_file)
    page = doc[0]

    text = page.get_text(flags=(TEXTFLAGS_TEXT | TEXT_DEHYPHENATE))
    print(text)

@click.command
@click.argument('pdf-file', type=click.Path(exists=True))
def cli(pdf_file):
    main(pdf_file)

if __name__ == '__main__':
    cli()

This gives

$ python3 issue.py issue_one_page.pdf
42
Ένα τίποτα μπορεί ν’ αλλάξει τα πάντα
Όταν επιτέλους περάσετε την Καμπή της Λανθάνουσας Δυ-
νατότητας, οι περισσότεροι θα θεωρήσουν ότι τα καταφέρατε εν 
μία νυκτί! Ο κόσμος που μας περιβάλλει, βλέπει μόνο την κο-
ρύφωση της δράσης μας και όχι όσα προηγήθηκαν. Εσείς όμως, 
γνωρίζετε ότι η επιτυχία σας έγινε εφικτή χάρη στην προσπάθεια 
που καταβάλατε για πολύ καιρό, όταν πιστεύατε ότι δεν σημειώ-
νατε πρόοδο. 
Είναι το ανθρώπινο ισοδύναμο της γεωλογικής πίεσης. Δύο 
τεκτονικές πλάκες μπορεί να συγκλίνουν μεταξύ τους για εκατομ-
μύρια χρόνια και η πίεση σταδιακά να συσσωρεύεται. Κι έπειτα 
κάποια μέρα, τρίβονται μεταξύ τους και πάλι με τον ίδιο τρόπο 
που το έκαναν όλα αυτά τα χρόνια, αλλά αυτή τη φορά η πίεση 
είναι μεγάλη. Γίνεται σεισμός. H αλλαγή μπορεί να συντελείται 
χρόνια, μέχρι να φτάσει στο σημείο της ορατής της εκτόνωσης.
Η επιδεξιότητα απαιτεί υπομονή. Οι Σαν Αντόνιο Σπερς (23), 
μια από τις πιο επιτυχημένες ομάδες στην ιστορία του NBA, 
έχουν μια φράση του κοινωνικού μεταρρυθμιστή Τζέικομπ Ρίις 
στα αποδυτήριά τους: «Όταν απελπίζομαι, κάθομαι και κοιτάζω 
κάποιον λιθοξόο να σφυροκοπάει την πέτρα του. Τη σφυροκοπά-
ει ίσως και εκατό φορές, χωρίς να σχηματίζεται ούτε μια ρωγμή 
στην επιφάνειά της. Κι όμως στο εκατοστό πρώτο χτύπημα η πέ-
τρα θα κοπεί στα δύο και ξέρω ότι αυτό δεν οφείλεται στο τελευ-
ταίο χτύπημα, αλλά σε όλα όσα είχαν προηγηθεί».
ΑΠΟΤΕΛΕΣΜΑΤΑ
joaquimcampos commented 2 years ago

I believe the issue is that the text extraction is identifying different lines as belonging to different blocks, and TEXT_DEHYPHENATE only joins lines and spans within the same block.

JorjMcKie commented 2 years ago

Ah, have you confirmed this is the case here? I have starte studying the file, but I didn't look at that detail yet. If the lines indeed are in different blocks, then you are quite right ...

JorjMcKie commented 2 years ago

Just tested it: you are right! Every line is in its own block. So indeed dehyphenation cannot work. The algorithm behind bringing text into the block/line/span hierarchy (located within MuPDF) takes a bunch of criteria into account like inter-line distance, font size, font characteristics (ascender, descender) and more ... but no interpretation of the text itself.

In this case, each line height is 12.74. The distance between a line's bottom to the next line's top is 4.3. Also - as a preliminary analysis shows - each line is coded in its own PDF text object, i.e. wrapped in its own string pairs BT/ET. Obviously, taken together this was too much for MuPDF to put the lines in the same blocks.

So you were having the right idea - this example is not suitable for dehyphenation.

JorjMcKie commented 2 years ago

Based on the insight presented by your example, we will insert a comment in the documentation.

jamie-lemon commented 2 years ago

I'll be sure to update https://pymupdf.readthedocs.io/en/latest/vars.html?highlight=dehyphenate#TEXT_DEHYPHENATE with some notes soon. Going forward, maybe we could parameterise line-height or something alongside this flag so that lines are considered to be part of the same block? No idea if that is something which is feasible or not.

JorjMcKie commented 2 years ago

I'll be sure to update https://pymupdf.readthedocs.io/en/latest/vars.html?highlight=dehyphenate#TEXT_DEHYPHENATE with some notes soon. Going forward, maybe we could parameterise line-height or something alongside this flag so that lines are considered to be part of the same block? No idea if that is something which is feasible or not.

I am afraid this would have to happen inside MuPDF's text page logic. Any change we may want to introduce has consequences that also apply to things like text search - not yet talking about that subsequent lines may not have the same inclination angle. Also, if text is not coded in reading sequence, the whole thing breaks down anyway. We might think about increasing the threshold WRT inter-line distances - which in this case seems to be the one reason why each line lives in its own block. As per today, there are no attempts inside PyMuPDF to interfere here - PyMuPDF just passes the text flags bit field on to MuPDF's text page creation.

JorjMcKie commented 2 years ago

I think this issue has now turned into a discussion item, so let me transfer it to there.

joaquimcampos commented 2 years ago

" We might think about increasing the threshold WRT inter-line distances - which in this case seems to be the one reason why each line lives in its own block."

I think this is a wise choice since visually the lines do seem to belong in the same block.

I have written my own python code to merge blocks where the last line of the first and first line of the next fit some criteria (relative vertical distance, horizontal position, etc.). This solved the issue.