Open gargarvin opened 2 years ago
I did a quick analysis on the first page. with some debug traces I've analysed the following line starting with PLUMBING SYSTEM - FAUCETS, VALVES AND CONNECTED FIXTURES: looking at the sequence : ut off ha The Font I've identified is F1. the transcoding table is the following
8 beginbfchar
<03> <0020>
<05> <0022>
<18> <0035>
<1B> <0038>
<1D> <003A>
<62> <00A0>
<E9> <0000>
<EA> <0000>
endbfchar
6 beginbfrange
<09> <16> <0026>
<24> <2C> <0041>
<2E> <3D> <004B>
<44> <4C> <0061>
<4E> <53> <006B>
<55> <5C> <0072>
endbfrange
the following codes are transcoded and added (ut of:
b'\x00X' -> u b'\x00W' -> t b'\x00\x03' -> (space) b'\x00R' -> o b'\x00\xe9' -> (\x00)
b'\x00\x03' -> (space) b'\x00K' -> h b'\x00D' -> a
when using sumatrapdf and pdfminer.six, I'm getting the same results with '\x00'. The only tool which seems to report properly (using copy-paste) is Acrobat Reader but I don't know where it is getting the results.
Help to analysis this case would be welcomed (@MartinThoma can you set the labels in accordance)
Also of note - this tool seems to be able to convert the PDF successfully without using any sort of OCR.
I resolved it like this, 'ff' case not work like other, that's why I replace it by chr(0)
.
page.extract_text().translate(str.maketrans({chr(0): 'ff', 0xFB01: 'fi', 0xFB02: 'fl', 0xFB03: 'ffi', 0xFB04: 'ffl'}))
The above method seems to replace every ligature with 'ff'. I also noticed my original PDF does not load so here it is again. Inspection_redacted.pdf
I am having a ligature issue with this PDF. 'fi', 'fl' and 'ff' characters are returning NULL
598 is similar to this issue.
MVCE: Code + PDF
PDF