py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.29k stars 1.4k forks source link

Unwanted Space between the letters of a word #1993

Closed mrbahrani closed 1 year ago

mrbahrani commented 1 year ago

I used the library to extract the text from a pdf file. There are some words which have been broken to two parts by an unwanted space.

Text from PDF: image Text from text file: image

Environment

OS: Windows 10 Python: 3.11 PyPDF: the latest

Code + PDF

This is the way I used the library

def convert_pdf_to_text(file_name):
    out = ""
    pdf_file_obj = open(file_name, 'rb')
    pdf_reader = PyPDF2.PdfReader(pdf_file_obj, strict=True)
    for page in pdf_reader.pages:
        text = page.extract_text()
        out += text
    return out

The PDF is confidential data. I tested that on multiple confidential PDF data

Traceback

This is the subpart of the (operation, operator) tuples that I printed to track. image The Tm tag causes a call of orientation function that adds the unwanted space. I have briefly reviewed the pdf specification 1.7. Yet, I do not know what Tm tag exactly does.

Having the space adding sections of orientation removed, the text was extracted perfectly.

pubpub-zz commented 1 year ago

https://pypdf.readthedocs.io/en/stable/user/extract-text.html#why-text-extraction-is-hard Tm does a change in scale/translation of the text / images. Without heavy page reconstruction the Tm can only be interpreted as a section separated with " " or "\n" currently no mod can be proposed