multi_column.py does not identify multiple columns in some cases

pymupdf / PyMuPDF-Utilities

Demos, examples and utilities using PyMuPDF

GNU Affero General Public License v3.0

566 stars 153 forks source link

multi_column.py does not identify multiple columns in some cases #113

Closed dgrunspan closed 1 year ago

dgrunspan commented 1 year ago

multi_column.py does not identify multiple columns in some cases 2464.pdf

Check in the attached file , in page number 1 where we have clear columns The utility mix them into one continuous stream Also page.get_text("text") has the same problem and the stream is mixed

Do lines in the same block means same columns? Its easy to see lines have a jump from end of one line to the next line in the x-axis

Your configuration (mandatory)

Latest pymupdf

JorjMcKie commented 1 year ago

The scripts in this repository are examples - with the intention to help develop solutions and apps based on PyMuPDF. They are not part of PyMuPDF itself.

They clearly are prone to problems in a concrete situation. We therefore do not accept issues in this repository and certainly not in the main PyMuPDF repository.

Instead, we would welcome your improvements via a PR. In your case, you will probably find a more promising approach using text spans instead of blocks.

dgrunspan commented 1 year ago

Ok, Thx BTW: I still think its a platform issue also, as page.get_text("text") gets a mixed text from both columns in the wrong order

JorjMcKie commented 1 year ago

Ok, Thx BTW: I still think its a platform issue also, as page.get_text("text") gets a mixed text from both columns in the wrong order

It is not a platform issue. Characters may come in arbitrary order within a PDF. This is normal. Everyone can make up to n! different PDFs which look exactly identical (n = number of characters on page). Here are two PDFs which I created to demonstrate this: PDF1, PDF2. Extract text from both of them ...