Closed Sunguru closed 1 month ago
any workaround on this so far? I ran into the exact same issue with pypdf
Hi.
The problem seemed to be the difference between the font size retrieved value and the actual space. As far as the area concerned, the PDF spaces were -277.75 apart, whereas the size retrieved from the font was 278.0. The larger font size was also -277.75 away in the area I checked, but the size I got from the font was 361.0.
If I round up the actual values, I think it will work. However, I am not familiar with how PDFs and fonts work and I cannot determine if that is the correct process.
Is this helpful?
Try rounding up abs(float(op)
.
https://github.com/py-pdf/pypdf/blob/8dd9fcb8d0ed06fa5230bd9a5ce5ffea80d04245/pypdf/_page.py#L1992
Sorry, I have an addition. It seems that it is not extracting spaces, but judging whether they are separated by more than the size of the space. I think it is difficult to get small font size spaces between larger font sizes as spaces.
Missing spaces in extract_text() method. See attached PDFs. Text is being extracted nice, but it comes with no spaces from almost all fields.
Environment
Code + PDF
PDF: 0004.pdf
gives:
expected (copy-pasted with Google chrome):
0000.pdf
Yes, you may add to the tests. It is public
data
from here: https://northdakota.hazconnect.com/ListIncidentPublic.aspxp,s, Thank you for the great package!