Open sreeni5493 opened 3 years ago
Can you share the code that you are using to create this image?
Also seeing this issue when using pdfplumber https://github.com/jsvine/pdfplumber/issues/538 (that apparently depends on this lib for extraction). y coords are way off, x coords are correct. Extracting from bank statements so unfortunately not shareable directly. A couple of anecdotal observations
From my experience with these types of docs they can sometimes be put together weirdly which can cause issues for extraction, there could also be issues if there are relative areas of the document then some parts would be scaled to screen size so a fixed pixel size would’nt work
Ack. My solution here was to convert source PDFs to images, rerun OCR using appropriate tessaract settings and then process with pdfminer. This resolved the two issues identified in my previous comment.
I am using pdfplumber which is built on top of pdfminer.six
But the issue is the coordinates coming from pdfminer.six.
Here is the pdf v2.pdf
The characters when highlighted appear like this. There is a huge offset in y direction.