Closed cppt closed 5 months ago
Poppler/pdftotext and pdf.js show the same pattern, thus it seems to be related to how the text layer has been generated - I do not think that there is much we can do about this. 20020448.pdf
reports no generator, while 20022132.pdf
states EO.Pdf 21.3.18.0
.
@stefan6419846, thanks. any reason to believe a different PDF parsing module would give different results in this situation based on your understanding of the implementation/limitations?
I have tested this with Poppler/pdftotext, pdf.js
and MuPDF - all of them are using another parser, but the output is basically the same as for pypdf. Thus I would argue that this is related to how the PDF files and their text layers have been generated and rather unlikely to be fixable in an easy manner.
Text Extractions uses /ToUnicode
entry that provide the conversion from character code (not always ascii/utf code) to UTF-8 code. This is purely independant from the "printing" rendering. Scrambling/modifying this entry will disturb most of the text extractions/Copy-paste capabilities
Should we close this issue ?
As some additional data point: These PDF files use an owner password and discourage everything except printing when looking at it with pdfinfo
. Thus the only way to get around this might be OCR, but this is out of scope for pypdf and therefore I am going to close this issue.
I'm looking to parse a collection of PDFs with similar format but notice results that are inconsistent.
For instance, this file is parsed by
pypdf
in a way I can make sense of: https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2022/20022132.pdfwhile this file is parsed resulting in text that's formatted wildly differently: https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2022/20020448.pdf
For example, capitalization is erratic despite the file taking a format very similar to the first.
It does appear there's 'consistency' for a given year, but not over time (ie, 2023 files are parsed consistently, but differently than 2021). Any guidance on what would be causing this/what could be improved?
Using the below Python code for reference. Output for the two files referenced below as well.
.
System Details: