Word out of page dimension in extract words

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.57k stars 659 forks source link

Word out of page dimension in extract words #538

Closed ManuelFay closed 2 years ago

ManuelFay commented 2 years ago

Describe the bug

Using extract_words on some pages leads to getting top and bottom coordinates that are out of the page dimension. Closer inspection shows that some char coordinates also exceed page dimensions.

Code to reproduce the problem

 with pdfplumber.open(pdf_path) as pdf:
    for page in tqdm(pdf.pages):
         page.extract_words()

PDF file

Very sensitive data, I will redact if absolutely necessary. At first, I'm just asking to see if anyone has an opinion on this.

Expected behavior

Coordinates should be bounded in page dimension.

Environment

pdfminer.six            20200517           
pdfplumber              0.5.28

Python version: 3.8
OS: Ubuntu 20

Additional context

Very rare problem (happens once in a blue moon)

samkit-jain commented 2 years ago

Hi @ManuelFay Appreciate your interest in the library. Yes, it can happen sometimes that the word coordinates are outside the page's dimension. See https://github.com/jsvine/pdfplumber/issues/308 for example. For such cases, assuming you don't need the content that is outside the page, you can also crop the page like so page = page.crop((0, 0, page.width, page.height)) so that you get only the words that are within the page's bounds.

jsvine commented 2 years ago

Yep, that's what I'd suggest too, @samkit-jain. But, @ManuelFay, are you suggesting the coordinates are incorrect? Or merely that it's surprising to have words outside the page bounds?

If you believe the coordinates have been incorrectly extracted, that's probably a better issue for pdfminer.six, the library this project uses to get those coordinates.

Closing this issue for now, but feel free to continue the discussion.

ManuelFay commented 2 years ago

Thanks a lot to both of you ! It was surprising to have words out of the page, that is all ;) I will crop !

johandebeurs commented 1 year ago

FYI, having the same issue when trying to extract content from various bank statements. Page dimensions are ~850px high, text coords are ~1500px. x-coords are correct, it is just the y-coords that are strange. Will search on pdfminer but posting here in case others are also challenged by this behaviour and want to know they aren't alone!