Closed datdao1998 closed 1 year ago
Hi @datdao1998, could you provide the PDF that you're using? Without it, it will be very difficult to diagnose your issue.
Hi @datdao1998, just checking back on this. Are you able to provide the PDF? You might also try repairing the PDF and seeing if that fixes the problem you've encountered.
This is definitely happening. Its not just the extract_words() function.
.chars itself has wrong coordinates for the characters.
Some of the coordinates for these words are even outside the page's BoundingBox.
I can email you the PDF.
Thanks, @sandzone. Please do email me the PDF; my email address is in my profile. And have you tried repairing the PDF?
Thanks. You are correct. Repairing the pdf resolved the issue. However, ghostscript couldn't repair - i had to use poppler command line utilities for that.
Is there a way to integrate pdf repair as a part of pdfplumber's extraction features?
Thanks for confirming, @sandzone. And that's an interesting idea. I've opened a separate issue for that here: #824
Description
When using function extract_words(), the coordinates of some extracted words are wrong, in my case word['x0'] = word['x1'] (but word['text'] still correct)
Code to reproduce the problem
Screenshots
Output
Visualize text box
Environment