jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.31k stars 647 forks source link

Reding different words as continuous one word. #752

Closed ameymn closed 1 year ago

ameymn commented 1 year ago

Describe the bug

When we read the pdf using pdfplumber the bug is it is reading some individual words together as single long word

Code to reproduce the problem

image

PDF file

pdfplumber bug.pdf

If you need to redact text in a sensitive PDF, you can run it through JoshData/pdf-redactor.

Expected behavior

I was expecting the words to be read individually as they has space between them.

Actual behavior

It is ignoring space between word and many individual word as continuous long word .

Screenshots

image image

Environment

Additional context

jsvine commented 1 year ago

Hi @ameymn, have you tried using the x_tolerance= parameter of .extract_text(...)?:

Screen Shot 2022-11-03 at 4 34 25 PM
jsvine commented 1 year ago

Closing this issue due to likely resolution posted, lack of response, and lack of original PDF to test against. Feel free, however, to continue the discussion.