Open MartinThoma opened 1 year ago
https://superuser.com/q/278562/64857 might be worth a try as well to fix the PDF
I've analyzed the PDF and I'm full of doubt:
{'/Name': '/F11', '/Subtype': '/TrueType', '/FirstChar': 32, '/Type': '/Font', '/BaseFont': '/IMZSPX+CourierNew,Bold', '/FontDescriptor': IndirectObject(459, 0, 1920817586256), '/ToUnicode': IndirectObject(462, 0, 1920817586256), '/LastChar': 255, '/Widths': IndirectObject(463, 0, 1920817586256)}
and the content of ToUnicode is:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo 3 dict dup begin
/Registry (Adobe) def
/Ordering (UCS) def
/Supplement 0 def
end def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
/WMode 0 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
3 beginbfchar
<0000> <0000>
<0001> <0000>
<0002> <0000>
endbfchar
endcmap
CMapName currentdict /CMap
defineresource pop
end end
the codespacerange shows 2-bytes encoding as stated in : https://adobe-type-tools.github.io/font-tech-notes/pdfs/5014.CIDFont_Spec.pdf (page 49,50)
when you decode the binary sequence with utf-16-be as expected for 2 bytes encoded glyphs, you get some chinese characters : this is why the decoding is not good
Adobe / pdfminer / pdf.js are extracting successfully but I do not understand how they can guess that the decoding should be done on one-byte only.
Help is welcomed ! 😣😫
@MasterOdin, Any ideas ?
@MasterOdin any chance for you to have a look ?
note to be analysed from pdf spec 1.7 page 432
I'm trying to extract text (see https://stackoverflow.com/q/75587416/562769 )
Environment
Which environment were you using when you encountered the problem?
Code + PDF
This is a minimal, complete example that shows the issue:
The PDF: https://efast2-filings-public.s3.amazonaws.com/prd/2013/09/13/20130913143132P030383431491001.pdf
The extracted output
The expected output
Other interesting stuff
pdftotext gives:
But the 3Heights PDF validator says it's ok:
PyMuPDF (fitz) manages to get the right text (although the whitespaces / text positions are not correct). I tried to clean it with
mutool clean -daf 20130913143132P030383431491001.pdf in.pdf
and then feed it into pypdf. Still the same issue.Also using
qpdf --linearize 20130913143132P030383431491001.pdf in.pdf
leads to the same result in pypdf.