py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.3k stars 1.41k forks source link

`DocumentInformation.title` sometimes return `bytes` instead of `str` #2929

Open reformy opened 1 week ago

reformy commented 1 week ago

I am reading a PDF file from: https://www.ms-ad-hd.com/en/ir/ir_event/event/presentation/main/01111119/teaserItems1/00/linkList/00/link/20220810Tranverse%20QA%20Summary.pdf The "title" for this doc returns bytes instead of str, although the method should always return str.

Environment

Python 3.11

$ python -m platform
macOS-14.4.1-arm64-arm-64bit

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.0.1, crypt_provider=('cryptography', '41.0.7'), PIL=9.5.0

Code + PDF

import io
import pypdf
import requests

response = requests.get('https://www.ms-ad-hd.com/en/ir/ir_event/event/presentation/main/01111119/teaserItems1/00/linkList/00/link/20220810Tranverse%20QA%20Summary.pdf')
pdf_reader = pypdf.PdfReader(io.BytesIO(response.content))
print(pdf_reader.metadata.title)
reformy commented 1 week ago

Happens in pypdf 5.1.0 too.

stefan6419846 commented 1 week ago

Thanks for the report. I just had a look at it. Debugging the data shows:

>>> r.metadata.get('/Title')
b'Microsoft Word - \xe3\x83\x88\xe3\x83\xa9\xe3\x83\xb3\xe3\x82\xb9\xe3\x83\x90\xe3\x83\xbc\xe3\x82\xb9\xe7\xa4\xbe\xe8\xb2\xb7\xe5\x8f\x8e\xe9\x9b\xbb\xe8\xa9\xb1\xe4\xbc\x9a\xe8\xad\xb0\xe8\x8b\xb1\xe8\xaa\x9eFinal.docx'
>>> type(r.metadata.get('/Title'))
<class 'pypdf.generic._base.ByteStringObject'>
>>> r.metadata.get('/Title').decode()
'Microsoft Word - トランスバース社買収電話会議英語Final.docx'
>>> 

We should probably extend https://github.com/py-pdf/pypdf/blob/98aa9742e757ec428e1953ba7f47c6d7c44b331a/pypdf/_doc_common.py#L120-L124 to support ByteStringObjects as well and try to decode it with the most common encodings (I use UTF-8 above). If this fails, we should raise a pypdf-specific error.

Do you want to submit a corresponding PR?

reformy commented 6 days ago

I've did a different fix - Is seems the create_string_object (after failing to find a BOM) tries to decode the bytes as UTF16 and others, but not UTF8. So I've added a attempt to do that.

pubpub-zz commented 6 days ago

I've did a different fix - Is seems the create_string_object (after failing to find a BOM) tries to decode the bytes as UTF16 and others, but not UTF8. So I've added a attempt to do that.

string are normally expectected to be encoded in pdfdocencoding or utf-16BE (see 3.8.1 in pdf reference 1.7)

reformy commented 5 days ago

The file I've found doesn't work with either. I can move the part that tries UTF-8 to be AFTER pdfdocencoding if that makes more sense.