jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.31k stars 647 forks source link

Plumbing pdf results in mixed characters of neighbouring words #764

Closed XuShanJiang closed 1 year ago

XuShanJiang commented 1 year ago

Describe the bug

The pdf is not plumbed correctly in text. The words are incomplete and characters of neighbouring words are mixed together.

Code to reproduce the problem

with pdfplumber.open("woo-besluit-contacten-rabo-pveu.pdf") as pdf: for i in range(len(pdf.pages)): print(pdf.pages[i].extract_text())

PDF file

woo-besluit-contacten-rabo-pveu.pdf

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

I expected that the text will be plumbed correctly. I imported several of these documents, in which the words are normal, like in the pdf-file.

Actual behavior

Some similar pdf files (including this one) is plumbed very weirdly. Characters of words are mixed. For example Pagina 7 is read as Pag7iv naa7n.

Screenshots

pdfpu

Environment

Additional context

I tried to copy the whole pdf and paste it in a text editor manually, which works totally fine...

jsvine commented 1 year ago

Hi @XuShanJiang, have you tried adjusting the y_tolerance setting described in the .extract_text(...) documentation?

jsvine commented 1 year ago

Hi @XuShanJiang, just checking back on this.

XuShanJiang commented 1 year ago

Hi @jsvine , I tried to adjust the y_tolerance and I experimented with different values. The text will change indeed, but not in the correct way.

jsvine commented 1 year ago

Thank you for letting me know. A few observations:

Screen Shot
Rustemhak commented 1 year ago

Hello @jsvine, thank you for solution. Okay, adding these options works for extract_text. How can I use the same options to extract_tables?

jsvine commented 1 year ago

Right now, that's not possible with pdfplumber, but adding that feature sounds like a good idea.

For the specific PDF discussed above, however, I don't think it'd work, due to the character-positioning issues. (I.e., many characters that should be inside a particular table cell are not.)

jsvine commented 1 year ago

@Rustemhak, in the latest version of pdfplumber (v0.8.0), you can now pass all .extract_text(...) arguments to .extract_tables(...), prefixing them with text_. So { "text_use_text_flow": True }.