py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.32k stars 1.41k forks source link

Missing spaces in extract_text() method #1328

Closed Sunguru closed 1 month ago

Sunguru commented 2 years ago

Missing spaces in extract_text() method. See attached PDFs. Text is being extracted nice, but it comes with no spaces from almost all fields.

Environment

$ python -c "import pypdf;print(pypdf.__version__)"
pypdf==3.14.0

Code + PDF

PDF: 0004.pdf

from pypdf import PdfReader, __version__

print(f"pypdf=={__version__}")

reader = PdfReader("0004.pdf")

page = reader.pages[0]
extracted = page.extract_text().split("Description:")[1].split("8/11/22")[0]
print(extracted)

gives:

 Reportingcrudeoilleak.
Leakwasisolatedtowell
pad.Segmentoflinewas
immediatelyisolated,now
estimatedat5barrelsofoil
spilt.Rootcausestill
unknownatthistime.

expected (copy-pasted with Google chrome):

Reporting crude oil leak.
Leak was isolated to well
pad. Segment of line was
immediately isolated, now
estimated at 5 barrels of oil
spilt. Root cause still
unknown at this time.

0000.pdf

Yes, you may add to the tests. It is public data from here: https://northdakota.hazconnect.com/ListIncidentPublic.aspx

p,s, Thank you for the great package!

tpcgold commented 1 year ago

any workaround on this so far? I ran into the exact same issue with pypdf

ssjkamei commented 1 month ago

Hi.

The problem seemed to be the difference between the font size retrieved value and the actual space. As far as the area concerned, the PDF spaces were -277.75 apart, whereas the size retrieved from the font was 278.0. The larger font size was also -277.75 away in the area I checked, but the size I got from the font was 361.0.

If I round up the actual values, I think it will work. However, I am not familiar with how PDFs and fonts work and I cannot determine if that is the correct process.

Is this helpful?

Try rounding up abs(float(op).

https://github.com/py-pdf/pypdf/blob/8dd9fcb8d0ed06fa5230bd9a5ce5ffea80d04245/pypdf/_page.py#L1992

ssjkamei commented 1 month ago

Sorry, I have an addition. It seems that it is not extracting spaces, but judging whether they are separated by more than the size of the space. I think it is difficult to get small font size spaces between larger font sizes as spaces.