jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Extracting hyperlinks raises UnicodeDecodeError #506

Closed devWhyqueue closed 3 years ago

devWhyqueue commented 3 years ago

Describe the bug

Non-UTF8-Hyperlinks in PDFs raise an UnicodeDecodeError and PdfPlumber fails to return any links of a given document.

Code to reproduce the problem

def extract_urls(self):
    with pdfplumber.open(self._file) as pdf:
        return {uri_obj['uri'] for uri_obj in pdf.hyperlinks}

Expected behavior

Problematic links should be ignored and the remaining links should be returned.

Actual behavior

PdfPlumber raises an UnicodeDecodeError with the following Traceback:

 File "...", line 184, in extract_urls
    return {uri_obj['uri'] for uri_obj in pdf.hyperlinks}
  File ".../pdf.py", line 98, in hyperlinks
    return list(itertools.chain(*gen))
  File ".../pdf.py", line 97, in <genexpr>
    gen = (p.hyperlinks for p in self.pages)
  File ".../page.py", line 155, in hyperlinks
    return [a for a in self.annots if a["uri"] is not None]
  File ".../page.py", line 151, in annots
    return list(map(parse, raw))
  File ".../page.py", line 127, in parse
    extras[k] = v.decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 8: invalid start byte

Environment