Wrong coordinates of words when using function extract_words()

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.57k stars 659 forks source link

Wrong coordinates of words when using function extract_words() #799

Closed datdao1998 closed 1 year ago

datdao1998 commented 1 year ago

Description

When using function extract_words(), the coordinates of some extracted words are wrong, in my case word['x0'] = word['x1'] (but word['text'] still correct)

Code to reproduce the problem

import pdfplumber

pdf_path = 'test.pdf'

with pdfplumber.open(pdf_path) as pdf:
      pages = pdf.pages
      for page in pages:
           words = page.extract_words()
           for word in words:
                 print(word['x0'], word['x1'], word['text'])

Screenshots

Output

Visualize text box

Environment

pdfplumber version: 0.6.0
Python version: 3.9.12
OS: Linux

jsvine commented 1 year ago

Hi @datdao1998, could you provide the PDF that you're using? Without it, it will be very difficult to diagnose your issue.

jsvine commented 1 year ago

Hi @datdao1998, just checking back on this. Are you able to provide the PDF? You might also try repairing the PDF and seeing if that fixes the problem you've encountered.

sandzone commented 1 year ago

This is definitely happening. Its not just the extract_words() function.

.chars itself has wrong coordinates for the characters.

Some of the coordinates for these words are even outside the page's BoundingBox.

I can email you the PDF.

jsvine commented 1 year ago

Thanks, @sandzone. Please do email me the PDF; my email address is in my profile. And have you tried repairing the PDF?

sandzone commented 1 year ago

Thanks. You are correct. Repairing the pdf resolved the issue. However, ghostscript couldn't repair - i had to use poppler command line utilities for that.

Is there a way to integrate pdf repair as a part of pdfplumber's extraction features?

jsvine commented 1 year ago

Thanks for confirming, @sandzone. And that's an interesting idea. I've opened a separate issue for that here: #824