Closed ManuelFay closed 2 years ago
Hi @ManuelFay Appreciate your interest in the library. Yes, it can happen sometimes that the word coordinates are outside the page's dimension. See https://github.com/jsvine/pdfplumber/issues/308 for example. For such cases, assuming you don't need the content that is outside the page, you can also crop the page like so page = page.crop((0, 0, page.width, page.height))
so that you get only the words that are within the page's bounds.
Yep, that's what I'd suggest too, @samkit-jain. But, @ManuelFay, are you suggesting the coordinates are incorrect? Or merely that it's surprising to have words outside the page bounds?
If you believe the coordinates have been incorrectly extracted, that's probably a better issue for pdfminer.six, the library this project uses to get those coordinates.
Closing this issue for now, but feel free to continue the discussion.
Thanks a lot to both of you ! It was surprising to have words out of the page, that is all ;) I will crop !
FYI, having the same issue when trying to extract content from various bank statements. Page dimensions are ~850px high, text coords are ~1500px. x-coords are correct, it is just the y-coords that are strange. Will search on pdfminer but posting here in case others are also challenged by this behaviour and want to know they aren't alone!
Describe the bug
Using
extract_words
on some pages leads to gettingtop
andbottom
coordinates that are out of the page dimension. Closer inspection shows that some char coordinates also exceed page dimensions.Code to reproduce the problem
PDF file
Very sensitive data, I will redact if absolutely necessary. At first, I'm just asking to see if anyone has an opinion on this.
Expected behavior
Coordinates should be bounded in page dimension.
Environment
Additional context
Very rare problem (happens once in a blue moon)