py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.23k stars 1.4k forks source link

Inconsistency in parsing similar files #2632

Closed cppt closed 5 months ago

cppt commented 5 months ago

I'm looking to parse a collection of PDFs with similar format but notice results that are inconsistent.

For instance, this file is parsed by pypdf in a way I can make sense of: https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2022/20022132.pdf

while this file is parsed resulting in text that's formatted wildly differently: https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2022/20020448.pdf

For example, capitalization is erratic despite the file taking a format very similar to the first.

It does appear there's 'consistency' for a given year, but not over time (ie, 2023 files are parsed consistently, but differently than 2021). Any guidance on what would be causing this/what could be improved?

Using the below Python code for reference. Output for the two files referenced below as well.

            import io
            import pypdf 
            import requests as re 

            url = 'https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/' + str(y) + '/' + str(i) + '.pdf'
            #url = 'https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2022/20022132.pdf'

            c = io.BytesIO(re.get(url = url).content)
            pdf = pypdf.PdfReader(c)

            text = ""

            for page in pdf.pages:
                # for line in page.extract_text().splitlines():
                #     print(line + "\n")                    
                text += page.extract_text() + "\n"      

            text = text.replace('\x00', '')    

.

url = 'https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2022/20022132.pdf'

text
Out[268]: 'P T R\nClerk of the House of Representatives • Legislative Resource Center • 135 Cannon Building • Washington, DC 20515\nF I\nName: Hon. Robert B. Aderholt\nStatus: Member\nState/District: AL04\nT\nID OwnerAsset Transaction\nTypeDate Notification\nDateAmount Cap .\nGains >\n$200?\nDC Tesla, Inc. (TSLA)  [ST] S 12/05/2022 12/05/2022 $1,001 - $15,000\nF S: New\n* For the complete list of asset type abbreviations, please visit https://fd.house.gov/reference/asset-type-codes.aspx .\nI P O\n Yes  No\nC  S\n I CERTIFY that the statements I have made on the attached Periodic Transaction Report are true, complete, and correct to the best of\nmy knowledge and belief. Further, I CERTIFY that I have disclosed all transactions as required by the STOCK Act.\nDigitally Signed: Hon. Robert B. Aderholt , 12/13/2022Filing ID #20022132\n'
url
Out[264]: 'https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2022/20020448.pdf'

text
Out[265]: 'PerioDic t ranSaction r ePortClerk of the House of Re\npresentatives • legislative Resource Center • 135 Cannon building • Washington, DC 20515f\niler informationname:\nHon. Richard W. Allen Status:\nMember State/District:\ngA12 t\nranSactionSiD\nowner asset transaction type\nDatenotification Date\namountcap. gains >\n$200?\nSP\nPutnam ultra Short Duration Income A (PSDTX)\n [Ab] S01/25/2022 01/28/2022 $50,001 - $100,000\ngfedcF\nIlINg S TATuS: NewS\nubHOlDINg O F: Charles Schwab IRASP\nStarbucks Corporation (SbuX) [ST]\nS01/25/2022 01/28/2022 $15,001 - $50,000\ngfedcF\nIlINg S TATuS: NewS\nubHOlDINg O F: Charles Schwab IRASP\nWalt Disney Company (DIS)  [ST] S 01/25/2022 01/28/2022 $1,001 - $15,000 gfedcbF\nIlINg S TATuS: NewS\nubHOlDINg O F: Charles Schwab IRA* For the complete list of asset type abbreviations, pl\nease visit https://fd.house.gov/reference/ asset-type-codes.aspx. a\nSSet claSS DetailSCharles Schwab IRA (Owner: SP)\ni\nnitial P ublic o fferingSn\nmlkj Yes nmlkji Noc\nertification anD Signatureg\nfedcb I CERTIFY that the statements I have made on the attached Periodic Transaction Report are true, complete, and correct to theFiling ID #20020448\nbest of my knowledge and belief. Further, I CERTIFY that I have disclosed all transactions as required by the STOCK Act.Digitally Signed: \nHon. Richard W. Allen , 02/13/2022 \n'

System Details:

chris@chris-X1C6:~$ python -m platform
Linux-6.5.0-28-generic-x86_64-with-glibc2.35
chris@chris-X1C6:~$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.2.0, crypt_provider=('cryptography', '37.0.1'), PIL=9.2.0
stefan6419846 commented 5 months ago

Poppler/pdftotext and pdf.js show the same pattern, thus it seems to be related to how the text layer has been generated - I do not think that there is much we can do about this. 20020448.pdf reports no generator, while 20022132.pdf states EO.Pdf 21.3.18.0.

cppt commented 5 months ago

@stefan6419846, thanks. any reason to believe a different PDF parsing module would give different results in this situation based on your understanding of the implementation/limitations?

stefan6419846 commented 5 months ago

I have tested this with Poppler/pdftotext, pdf.js and MuPDF - all of them are using another parser, but the output is basically the same as for pypdf. Thus I would argue that this is related to how the PDF files and their text layers have been generated and rather unlikely to be fixable in an easy manner.

pubpub-zz commented 5 months ago

Text Extractions uses /ToUnicode entry that provide the conversion from character code (not always ascii/utf code) to UTF-8 code. This is purely independant from the "printing" rendering. Scrambling/modifying this entry will disturb most of the text extractions/Copy-paste capabilities

pubpub-zz commented 5 months ago

Should we close this issue ?

stefan6419846 commented 5 months ago

As some additional data point: These PDF files use an owner password and discourage everything except printing when looking at it with pdfinfo. Thus the only way to get around this might be OCR, but this is out of scope for pypdf and therefore I am going to close this issue.