extract_text leads to Chinese characters instead of ASCII

MartinThoma commented 1 year ago

I'm trying to extract text (see https://stackoverflow.com/q/75587416/562769 )

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-139-generic-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf.__version__)"
3.5.0

Code + PDF

This is a minimal, complete example that shows the issue:

from io import BytesIO

from pypdf import PdfReader

def get_pdf_from_url(url: str, name: str):
    """Download the file"""
    import ssl
    import urllib.request
    from pathlib import Path
    from urllib.error import HTTPError

    cache_path = Path(name)
    ssl._create_default_https_context = ssl._create_unverified_context
    cpt = 3
    while cpt > 0:
        try:
            with urllib.request.urlopen(url) as response, cache_path.open(
                "wb",
            ) as out_file:
                out_file.write(response.read())
            cpt = 0
        except HTTPError as e:
            if cpt > 0:
                cpt -= 1
            else:
                raise e
    with open(cache_path, "rb") as fp:
        data = fp.read()
    return data

url = "https://efast2-filings-public.s3.amazonaws.com/prd/2013/09/13/20130913143132P030383431491001.pdf"
reader = PdfReader(BytesIO(get_pdf_from_url(url, "20130913143132P030383431491001.pdf")))
page_41 = reader.pages[40].extract_text()
print(page_41)

The PDF: https://efast2-filings-public.s3.amazonaws.com/prd/2013/09/13/20130913143132P030383431491001.pdf

The extracted output

Schedule H, line 4i
Schedule of A ssets (Held A t End of Year)
For the plan year beginning and ending
Name of plan
Employer Identification Number Three-digit
plan number
(a) (b) Identity of issue, borrower, lessor, or similar party(c) Description of investment including maturity date,
rate of interest, collateral, par, or maturity value(d) Cost (e) Current value〱⼰ㄯ㈰ㄲ ㄲ⼳ㄯ㈰ㄲ
1-800LOANMART 401(k) Plan
㤵ⴴ㠶㌳㠹 001
䩯桮⁈慮捯捫⁕十 䱩晥獴祬攠䅧杲敳獩癥 102,734
䩯桮⁈慮捯捫⁕十 䱩晥獴祬攠䝲潷瑨 159,791
䩯桮⁈慮捯捫⁕十 䱩晥獴祬攠䉡污湣敤 285,623
䩯桮⁈慮捯捫⁕十 䱩晥獴祬攠䵯摥牡瑥 9,130
䩯桮⁈慮捯捫⁕十 䱩晥獴祬攠䍯湳敲癡瑩癥 ㌰ⰶ㔰
䩯桮⁈慮捯捫⁕十 Real Est. Securities Fund ㄵⰷ㔹
䩯桮⁈慮捯捫⁕十 䑆䄠䕭敲杩湧⁍慲步瑳⁖慬略 3,957
䩯桮⁈慮捯捫⁕十 佰灥湨敩浥爠䑥癥汯灩湧⁍歴 ㈷ⰶ㤳
䩯桮⁈慮捯捫⁕十 䵩搠䍡瀠却潣欠䙵湤 156
䩯桮⁈慮捯捫⁕十 DFA U.S. Small Cap Fund ㄳⰵ㠲
䩯桮⁈慮捯捫⁕十 卭慬氠䍡瀠䝲潷瑨⁉湤數 7,444
䩯桮⁈慮捯捫⁕十 䥮瑬⁅煵楴礠䥮摥砠䙵湤 3,980
䩯桮⁈慮捯捫⁕十 EuroPacific Growth Fund 6,866
䩯桮⁈慮捯捫⁕十 International Growth Fund 138
䩯桮⁈慮捯捫⁕十 SSgA Mid Value Index Fund 1,739
䩯桮⁈慮捯捫⁕十 Small Cap Value Index 1,617
䩯桮⁈慮捯捫⁕十 噡汵攠䙵湤 5,106
䩯桮⁈慮捯捫⁕十 T. Rowe Price Sml Cap Val 5,510
䩯桮⁈慮捯捫⁕十 Fidelity ContraFund ㄴⰴ㔶
䩯桮⁈慮捯捫⁕十 噡汵攠䥮摥砠䙵湤 ㄴⰸ〸
䩯桮⁈慮捯捫⁕十 㔰〠䥮摥砠䙵湤 5,983
䩯桮⁈慮捯捫⁕十 䍡灩瑡氠䥮捯浥⁂畩汤敲 ㄰ⰵ㐰
䩯桮⁈慮捯捫⁕十 䅭敲楣慮⁂慬慮捥搠䙵湤 9,078
䩯桮⁈慮捯捫⁕十 PIMCO Global Bond ㄲⰸ㈶
䩯桮⁈慮捯捫⁕十 偉䵃传呯瑡氠剥瑵牮 ㄲⰸ㈶
䩯桮⁈慮捯捫⁕十 Money Market Fund 1,041
䩯桮⁈慮捯捫⁕十 卨潲琠呥牭⁆敤敲慬 0

The expected output

Schedule H, line 4i
Schedule of Assets (Held At End of Year)
For the plan year beginning   01/01/2012
and ending 12/31/2012
Name of plan        1-800LOANMART 401(k) Plan
Employer Identification Number   95-4863389
Three-digit
plan number    001
(a)       (b) Identity of issue, borrower, lessor, or similar party (c) Description of investment including maturity date,  rate of interest, collateral, par, or maturity value   (d) Cost     (e) Current value

John Hancock USA              Lifestyle Aggressive         102,734
John Hancock USA              Lifestyle Growth               159,791
John Hancock USA              Lifestyle Balanced            285,623
John Hancock USA              Lifestyle Moderate           9,130
John Hancock USA              Lifestyle Conservative     30,650
John Hancock USA              Real Est. Securities Fund  15,759
John Hancock USA              DFA Emerging Markets Value      3,957
John Hancock USA              Oppenheimer Developing Mkt        27,693
John Hancock USA              Mid Cap Stock Fund                      156
John Hancock USA             DFA U.S. Small Cap Fund          13,582
John Hancock USA              Small Cap Growth Index           7,444
John Hancock USA              Intl Equity Index Fund              3,980
John Hancock USA             EuroPacific Growth Fund            6,866
John Hancock USA             International Growth Fund         138
John Hancock USA             SSgA Mid Value Index Fund          1,739
John Hancock USA             Small Cap Value Index                     1,617
John Hancock USA            Value Fund                                   5,106
John Hancock USA            T. Rowe Price Sml Cap Val           5,510
John Hancock USA            Fidelity ContraFund                  14,456
John Hancock USA            Value Index Fund                   14,808
John Hancock USA           500 Index Fund                         5,983
John Hancock USA           Capital Income Builder            10,540
John Hancock USA           American Balanced Fund          9,078
John Hancock USA           PIMCO Global Bond                12,826
John Hancock USA          PIMCO Total Return                12,826
John Hancock USA          Money Market Fund             1,041
John Hancock USA          Short Term Federal               0
...

Other interesting stuff

pdftotext gives:

Internal Error: xref num 403 not found but needed, try to reconstruct<0a>

But the 3Heights PDF validator says it's ok:

The document does conform to the PDF 1.4 standard.

PyMuPDF (fitz) manages to get the right text (although the whitespaces / text positions are not correct). I tried to clean it with mutool clean -daf 20130913143132P030383431491001.pdf in.pdf and then feed it into pypdf. Still the same issue.

Also using qpdf --linearize 20130913143132P030383431491001.pdf in.pdf leads to the same result in pypdf.

MartinThoma commented 1 year ago

https://superuser.com/q/278562/64857 might be worth a try as well to fix the PDF

pubpub-zz commented 1 year ago

I've analyzed the PDF and I'm full of doubt:

the contentstream contains the text fully readable : it consists of 1 byte text.

the font referenced for this text is /F11 the content is the following:

{'/Name': '/F11', '/Subtype': '/TrueType', '/FirstChar': 32, '/Type': '/Font', '/BaseFont': '/IMZSPX+CourierNew,Bold', '/FontDescriptor': IndirectObject(459, 0, 1920817586256), '/ToUnicode': IndirectObject(462, 0, 1920817586256), '/LastChar': 255, '/Widths': IndirectObject(463, 0, 1920817586256)}

and the content of ToUnicode is:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo 3 dict dup begin
/Registry (Adobe) def
/Ordering (UCS) def
/Supplement 0 def
end def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
/WMode 0 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
3 beginbfchar
<0000> <0000>
<0001> <0000>
<0002> <0000>
endbfchar
endcmap
CMapName currentdict /CMap
defineresource pop
end end

the codespacerange shows 2-bytes encoding as stated in : https://adobe-type-tools.github.io/font-tech-notes/pdfs/5014.CIDFont_Spec.pdf (page 49,50)

when you decode the binary sequence with utf-16-be as expected for 2 bytes encoded glyphs, you get some chinese characters : this is why the decoding is not good

Adobe / pdfminer / pdf.js are extracting successfully but I do not understand how they can guess that the decoding should be done on one-byte only.

Help is welcomed ! 😣😫

pubpub-zz commented 1 year ago

@MasterOdin, Any ideas ?

pubpub-zz commented 1 year ago

@MasterOdin any chance for you to have a look ?

pubpub-zz commented 1 year ago

note to be analysed from pdf spec 1.7 page 432

py-pdf / pypdf