pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.95k stars 930 forks source link

Y coordinates wrong for certain PDFs leading to highlighting wrong areas in the PDF #619

Open sreeni5493 opened 3 years ago

sreeni5493 commented 3 years ago

I am using pdfplumber which is built on top of pdfminer.six

But the issue is the coordinates coming from pdfminer.six.

Here is the pdf v2.pdf

The characters when highlighted appear like this. There is a huge offset in y direction.

chars

pietermarsman commented 2 years ago

Can you share the code that you are using to create this image?

johandebeurs commented 1 year ago

Also seeing this issue when using pdfplumber https://github.com/jsvine/pdfplumber/issues/538 (that apparently depends on this lib for extraction). y coords are way off, x coords are correct. Extracting from bank statements so unfortunately not shareable directly. A couple of anecdotal observations

pettzilla1 commented 1 year ago

From my experience with these types of docs they can sometimes be put together weirdly which can cause issues for extraction, there could also be issues if there are relative areas of the document then some parts would be scaled to screen size so a fixed pixel size would’nt work

johandebeurs commented 1 year ago

Ack. My solution here was to convert source PDFs to images, rerun OCR using appropriate tessaract settings and then process with pdfminer. This resolved the two issues identified in my previous comment.