Spaces (that do not exist in the original PDF) appear in the output of extract_text()

py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

https://pypdf.readthedocs.io/en/latest/

Other

8.12k stars 1.39k forks source link

Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336

Closed renanbirck closed 1 hour ago

renanbirck commented 9 months ago

I am trying to parse this PDF. However, I am getting on the output of extract_text() a bunch of spaces that are not in the original PDF.

See the screenshot - the original PDF on the left, the output of for what I mean (e.g. "Av. Beir a Rio" should be "Av. Beira Rio", "Cen tro" should be "Centro"):

If I copy/paste from Okular or other PDF reader to a text document, it is copied correctly, so I know the PDF file is not broken.

Environment

I am using Python 3.12 in Fedora 39.

$ python -m platform
Linux-6.6.4-200.fc39.x86_64-x86_64-with-glibc2.38

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.1, crypt_provider=('pycryptodome', '3.19.0'), PIL=10.1.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
reader = PdfReader('Pesquisa-de-Precos-Combustiveis-novembro-2023.pdf')
text = reader.pages[0].extract_text()

stefan6419846 commented 9 months ago

This is a known limitation with multiple similar issues already being reported and is explained inside the docs as well: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#whitespaces

TL;DR: How a text layer is being retrieved depends on the actual library implementation - each tends to have its own advantages and limits. In this specific case, the pdftotext layout mode (based upon poppler, one of the standard PDF libraries for Linux systems) seems to provide "correct" results, as well as mutool convert.

renanbirck commented 9 months ago

This is a known limitation with multiple similar issues already being reported and is explained inside the docs as well: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#whitespaces

I understand. Is there any way I can work around it in pypdf? Other PDF libraries (like pymupdf, based on mupdf) don't have that problem.

stefan6419846 commented 9 months ago

You might want to have a look at the code from https://github.com/py-pdf/pypdf/discussions/2038#discussioncomment-7736074.

pubpub-zz commented 6 months ago

@renanbirck the extra spaces the output of the "tt" special character conversion. I don't know how to get the good output :the translation is not part of the ToUnicode field. I don't know neither how other programs are doing the translation

ssjkamei commented 19 hours ago

ActualText_ti

I think #2882 will fix many of the whitespace issues. I think the ligatures are the same problem as #1351.

stefan6419846 commented 1 hour ago

According to https://github.com/py-pdf/pypdf/pull/2882#issuecomment-2388783234, this has just been fixed.