Closed tungph closed 2 years ago
@jsvine @samkit-jain
To bypass the problem, I try to decode with utf-16
if utf-8
fail:
# page.py@125
for k, v in extras.items():
if v is not None:
try:
extras[k] = v.decode('utf-8')
except UnicodeDecodeError:
extras[k] = v.decode('utf-16')
Thank you for flagging this @tungph. I will look into this.
Got the same problem trying to extract hyperlinks.
The following code:
def extract_urls(self):
with pdfplumber.open(self._file) as pdf:
return {uri_obj['uri'] for uri_obj in pdf.hyperlinks}
raises an UnicodeDecodeError with this traceback:
File "...", line 184, in extract_urls
return {uri_obj['uri'] for uri_obj in pdf.hyperlinks}
File ".../pdf.py", line 98, in hyperlinks
return list(itertools.chain(*gen))
File ".../pdf.py", line 97, in <genexpr>
gen = (p.hyperlinks for p in self.pages)
File ".../page.py", line 155, in hyperlinks
return [a for a in self.annots if a["uri"] is not None]
File ".../page.py", line 151, in annots
return list(map(parse, raw))
File ".../page.py", line 127, in parse
extras[k] = v.decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 8: invalid start byte
Would highly appreciate a fix in upcoming versions!
Thanks again @tungph for raising this issue and the suggested fix, and @devWhyqueue for seconding. The commit above should fix this once merged and should be available in the next release.
This fix is now part of the latest release, v0.6.0
.
Describe the bug
While trying to get contents from an annotation, I got this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 0: invalid start byte
The content for the first annotation is in Japanese.
Code to reproduce the problem
PDF file
test.pdf
Expected behavior
Environment
Additional context
The content for the annotations is in both English and Japanese