When extracting text from the attached PDF, the output contains garbled characters. Additionally, when I tried copying & pasting the content from other PDF viewers, similar issues occurred.
I'm unsure if this is related to encoding settings or if there's a way to correct the extraction process. Any guidance or fixes would be appreciated.
your documents have been protected against copy : pypdf reports the same results as other viewers. There is nothing to be done but e-print/scan with OCR.
this has been reported many times
Issue
When extracting text from the attached PDF, the output contains garbled characters. Additionally, when I tried copying & pasting the content from other PDF viewers, similar issues occurred.
I'm unsure if this is related to encoding settings or if there's a way to correct the extraction process. Any guidance or fixes would be appreciated.
PDF is as attached
king arthur.pdf The Phantom of the Opera.pdf
pypdf version
pypdf==4.3.1
Code Sample:
Output
The extracted content includes cid values such as: