Closed mrbahrani closed 1 year ago
https://pypdf.readthedocs.io/en/stable/user/extract-text.html#why-text-extraction-is-hard Tm does a change in scale/translation of the text / images. Without heavy page reconstruction the Tm can only be interpreted as a section separated with " " or "\n" currently no mod can be proposed
I used the library to extract the text from a pdf file. There are some words which have been broken to two parts by an unwanted space.
Text from PDF: Text from text file:
Environment
OS: Windows 10 Python: 3.11 PyPDF: the latest
Code + PDF
This is the way I used the library
The PDF is confidential data. I tested that on multiple confidential PDF data
Traceback
This is the subpart of the (operation, operator) tuples that I printed to track. The Tm tag causes a call of
orientation
function that adds the unwanted space. I have briefly reviewed the pdf specification 1.7. Yet, I do not know whatTm
tag exactly does.Having the space adding sections of orientation removed, the text was extracted perfectly.