Non-UTF8-Hyperlinks in PDFs raise an UnicodeDecodeError and PdfPlumber fails to return any links of a given document.
Code to reproduce the problem
def extract_urls(self):
with pdfplumber.open(self._file) as pdf:
return {uri_obj['uri'] for uri_obj in pdf.hyperlinks}
Expected behavior
Problematic links should be ignored and the remaining links should be returned.
Actual behavior
PdfPlumber raises an UnicodeDecodeError with the following Traceback:
File "...", line 184, in extract_urls
return {uri_obj['uri'] for uri_obj in pdf.hyperlinks}
File ".../pdf.py", line 98, in hyperlinks
return list(itertools.chain(*gen))
File ".../pdf.py", line 97, in <genexpr>
gen = (p.hyperlinks for p in self.pages)
File ".../page.py", line 155, in hyperlinks
return [a for a in self.annots if a["uri"] is not None]
File ".../page.py", line 151, in annots
return list(map(parse, raw))
File ".../page.py", line 127, in parse
extras[k] = v.decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 8: invalid start byte
Describe the bug
Non-UTF8-Hyperlinks in PDFs raise an UnicodeDecodeError and PdfPlumber fails to return any links of a given document.
Code to reproduce the problem
Expected behavior
Problematic links should be ignored and the remaining links should be returned.
Actual behavior
PdfPlumber raises an UnicodeDecodeError with the following Traceback:
Environment