Open reformy opened 1 week ago
Happens in pypdf 5.1.0 too.
Thanks for the report. I just had a look at it. Debugging the data shows:
>>> r.metadata.get('/Title')
b'Microsoft Word - \xe3\x83\x88\xe3\x83\xa9\xe3\x83\xb3\xe3\x82\xb9\xe3\x83\x90\xe3\x83\xbc\xe3\x82\xb9\xe7\xa4\xbe\xe8\xb2\xb7\xe5\x8f\x8e\xe9\x9b\xbb\xe8\xa9\xb1\xe4\xbc\x9a\xe8\xad\xb0\xe8\x8b\xb1\xe8\xaa\x9eFinal.docx'
>>> type(r.metadata.get('/Title'))
<class 'pypdf.generic._base.ByteStringObject'>
>>> r.metadata.get('/Title').decode()
'Microsoft Word - トランスバース社買収電話会議英語Final.docx'
>>>
We should probably extend https://github.com/py-pdf/pypdf/blob/98aa9742e757ec428e1953ba7f47c6d7c44b331a/pypdf/_doc_common.py#L120-L124 to support ByteStringObject
s as well and try to decode it with the most common encodings (I use UTF-8 above). If this fails, we should raise a pypdf-specific error.
Do you want to submit a corresponding PR?
I've did a different fix - Is seems the create_string_object
(after failing to find a BOM) tries to decode the bytes as UTF16 and others, but not UTF8. So I've added a attempt to do that.
I've did a different fix - Is seems the
create_string_object
(after failing to find a BOM) tries to decode the bytes as UTF16 and others, but not UTF8. So I've added a attempt to do that.
string are normally expectected to be encoded in pdfdocencoding or utf-16BE (see 3.8.1 in pdf reference 1.7)
The file I've found doesn't work with either. I can move the part that tries UTF-8 to be AFTER pdfdocencoding if that makes more sense.
I am reading a PDF file from: https://www.ms-ad-hd.com/en/ir/ir_event/event/presentation/main/01111119/teaserItems1/00/linkList/00/link/20220810Tranverse%20QA%20Summary.pdf The "title" for this doc returns
bytes
instead ofstr
, although the method should always returnstr
.Environment
Python 3.11
Code + PDF