Closed renanbirck closed 1 hour ago
This is a known limitation with multiple similar issues already being reported and is explained inside the docs as well: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#whitespaces
TL;DR: How a text layer is being retrieved depends on the actual library implementation - each tends to have its own advantages and limits. In this specific case, the pdftotext
layout mode (based upon poppler
, one of the standard PDF libraries for Linux systems) seems to provide "correct" results, as well as mutool convert
.
This is a known limitation with multiple similar issues already being reported and is explained inside the docs as well: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#whitespaces
I understand. Is there any way I can work around it in pypdf? Other PDF libraries (like pymupdf, based on mupdf) don't have that problem.
You might want to have a look at the code from https://github.com/py-pdf/pypdf/discussions/2038#discussioncomment-7736074.
@renanbirck the extra spaces the output of the "tt" special character conversion. I don't know how to get the good output :the translation is not part of the ToUnicode field. I don't know neither how other programs are doing the translation
I think #2882 will fix many of the whitespace issues. I think the ligatures are the same problem as #1351.
According to https://github.com/py-pdf/pypdf/pull/2882#issuecomment-2388783234, this has just been fixed.
I am trying to parse this PDF. However, I am getting on the output of extract_text() a bunch of spaces that are not in the original PDF.
See the screenshot - the original PDF on the left, the output of for what I mean (e.g. "Av. Beir a Rio" should be "Av. Beira Rio", "Cen tro" should be "Centro"):
If I copy/paste from Okular or other PDF reader to a text document, it is copied correctly, so I know the PDF file is not broken.
Environment
I am using Python 3.12 in Fedora 39.
Code + PDF
This is a minimal, complete example that shows the issue: