Plumbing pdf results in mixed characters of neighbouring words

jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

MIT License

6.31k stars 647 forks source link

Plumbing pdf results in mixed characters of neighbouring words #764

Closed XuShanJiang closed 1 year ago

XuShanJiang commented 1 year ago

Describe the bug

The pdf is not plumbed correctly in text. The words are incomplete and characters of neighbouring words are mixed together.

Code to reproduce the problem

with pdfplumber.open("woo-besluit-contacten-rabo-pveu.pdf") as pdf: for i in range(len(pdf.pages)): print(pdf.pages[i].extract_text())

PDF file

woo-besluit-contacten-rabo-pveu.pdf

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

I expected that the text will be plumbed correctly. I imported several of these documents, in which the words are normal, like in the pdf-file.

Actual behavior

Some similar pdf files (including this one) is plumbed very weirdly. Characters of words are mixed. For example Pagina 7 is read as Pag7iv naa7n.

Screenshots

pdfpu

Environment

pdfplumber version: 0.7.5
Python version: 3.8
OS: Linux

Additional context

I tried to copy the whole pdf and paste it in a text editor manually, which works totally fine...

jsvine commented 1 year ago

Hi @XuShanJiang, have you tried adjusting the y_tolerance setting described in the .extract_text(...) documentation?

jsvine commented 1 year ago

Hi @XuShanJiang, just checking back on this.

XuShanJiang commented 1 year ago

Hi @jsvine , I tried to adjust the y_tolerance and I experimented with different values. The text will change indeed, but not in the correct way.

jsvine commented 1 year ago

Thank you for letting me know. A few observations:

This is rasterized PDF whose text has been OCR'ed. That is: The text is not the original digital text, but rather another piece of software's attempt to recreate it. These types of PDFs are generally harder to work with in pdfplumber because they lack a lot of the important original information.
Moreover, an examination of the character positioning via pdfplumber's visual debugging indicates that the OCR software has positioned the text in an unusual way — and in a way that creates overlaps that explain the results you're getting. E.g.:

That said, if you use these settings, I believe you'll get what you're looking for: page.extract_text(layout=True, use_text_flow=True) — (use_text_flow tells the layout engine to use the characters in the sequence they are provided in the file, rather than their x/y position). Does that work for you?

Rustemhak commented 1 year ago

Hello @jsvine, thank you for solution. Okay, adding these options works for extract_text. How can I use the same options to extract_tables?

jsvine commented 1 year ago

Right now, that's not possible with pdfplumber, but adding that feature sounds like a good idea.

For the specific PDF discussed above, however, I don't think it'd work, due to the character-positioning issues. (I.e., many characters that should be inside a particular table cell are not.)

jsvine commented 1 year ago

@Rustemhak, in the latest version of pdfplumber (v0.8.0), you can now pass all .extract_text(...) arguments to .extract_tables(...), prefixing them with text_. So { "text_use_text_flow": True }.