Closed dgrunspan closed 1 year ago
The scripts in this repository are examples - with the intention to help develop solutions and apps based on PyMuPDF. They are not part of PyMuPDF itself.
They clearly are prone to problems in a concrete situation. We therefore do not accept issues in this repository and certainly not in the main PyMuPDF repository.
Instead, we would welcome your improvements via a PR. In your case, you will probably find a more promising approach using text spans instead of blocks.
Ok, Thx BTW: I still think its a platform issue also, as page.get_text("text") gets a mixed text from both columns in the wrong order
Ok, Thx BTW: I still think its a platform issue also, as page.get_text("text") gets a mixed text from both columns in the wrong order
It is not a platform issue. Characters may come in arbitrary order within a PDF. This is normal.
Everyone can make up to n!
different PDFs which look exactly identical (n = number of characters on page). Here are two PDFs which I created to demonstrate this: PDF1, PDF2.
Extract text from both of them ...
multi_column.py does not identify multiple columns in some cases 2464.pdf
Check in the attached file , in page number 1 where we have clear columns The utility mix them into one continuous stream Also page.get_text("text") has the same problem and the stream is mixed
Do lines in the same block means same columns? Its easy to see lines have a jump from end of one line to the next line in the x-axis
Your configuration (mandatory)
Latest pymupdf