py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.08k stars 1.39k forks source link

Ligature issue when converting PDF to text #1351

Open gargarvin opened 2 years ago

gargarvin commented 2 years ago

I am having a ligature issue with this PDF. 'fi', 'fl' and 'ff' characters are returning NULL

598 is similar to this issue.

MVCE: Code + PDF

from PyPDF2 import PdfReader

reader = PdfReader("Inspection_redacted.pdf")
for page in reader.pages:
    print(page.extract_text())

PDF

pubpub-zz commented 2 years ago

I did a quick analysis on the first page. with some debug traces I've analysed the following line starting with PLUMBING SYSTEM - FAUCETS, VALVES AND CONNECTED FIXTURES: looking at the sequence : ut off ha The Font I've identified is F1. the transcoding table is the following

8 beginbfchar
<03> <0020>
<05> <0022>
<18> <0035>
<1B> <0038>
<1D> <003A>
<62> <00A0>
<E9> <0000>
<EA> <0000>
endbfchar
6 beginbfrange
<09> <16> <0026>
<24> <2C> <0041>
<2E> <3D> <004B>
<44> <4C> <0061>
<4E> <53> <006B>
<55> <5C> <0072>
endbfrange

the following codes are transcoded and added (ut of:

b'\x00X' -> u b'\x00W' -> t b'\x00\x03' -> (space) b'\x00R' -> o b'\x00\xe9' -> (\x00)
b'\x00\x03' -> (space) b'\x00K' -> h b'\x00D' -> a

when using sumatrapdf and pdfminer.six, I'm getting the same results with '\x00'. The only tool which seems to report properly (using copy-paste) is Acrobat Reader but I don't know where it is getting the results.

Help to analysis this case would be welcomed (@MartinThoma can you set the labels in accordance)

gargarvin commented 2 years ago

Also of note - this tool seems to be able to convert the PDF successfully without using any sort of OCR.

PavelHightTower commented 9 months ago

I resolved it like this, 'ff' case not work like other, that's why I replace it by chr(0).

page.extract_text().translate(str.maketrans({chr(0): 'ff', 0xFB01: 'fi', 0xFB02: 'fl', 0xFB03: 'ffi', 0xFB04: 'ffl'}))
gargarvin commented 9 months ago

The above method seems to replace every ligature with 'ff'. I also noticed my original PDF does not load so here it is again. Inspection_redacted.pdf