jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Cannot decode contents of annotations #463

Closed tungph closed 2 years ago

tungph commented 3 years ago

Describe the bug

While trying to get contents from an annotation, I got this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 0: invalid start byte

The content for the first annotation is in Japanese.

Code to reproduce the problem

    with pdfplumber.open('test.pdf') as pdf:
        print(pdf.annots)

PDF file

test.pdf

Expected behavior

{'uri': None, 'title': 'tung.phanhuy', 'contents': '日本語'}
{'uri': None, 'title': None, 'contents': None}
{'uri': None, 'title': 'tung.phanhuy', 'contents': '"well"'}
{'uri': None, 'title': None, 'contents': None}
{'uri': None, 'title': 'tung.phanhuy', 'contents': 'table'}
{'uri': None, 'title': None, 'contents': None}

Environment

Additional context

The content for the annotations is in both English and Japanese

tungph commented 3 years ago

@jsvine @samkit-jain To bypass the problem, I try to decode with utf-16 if utf-8 fail:

    # page.py@125

    for k, v in extras.items():
        if v is not None:
            try:
                extras[k] = v.decode('utf-8')
            except UnicodeDecodeError:
                extras[k] = v.decode('utf-16')
jsvine commented 3 years ago

Thank you for flagging this @tungph. I will look into this.

devWhyqueue commented 3 years ago

Got the same problem trying to extract hyperlinks.

The following code:

def extract_urls(self):
    with pdfplumber.open(self._file) as pdf:
        return {uri_obj['uri'] for uri_obj in pdf.hyperlinks}

raises an UnicodeDecodeError with this traceback:

 File "...", line 184, in extract_urls
    return {uri_obj['uri'] for uri_obj in pdf.hyperlinks}
  File ".../pdf.py", line 98, in hyperlinks
    return list(itertools.chain(*gen))
  File ".../pdf.py", line 97, in <genexpr>
    gen = (p.hyperlinks for p in self.pages)
  File ".../page.py", line 155, in hyperlinks
    return [a for a in self.annots if a["uri"] is not None]
  File ".../page.py", line 151, in annots
    return list(map(parse, raw))
  File ".../page.py", line 127, in parse
    extras[k] = v.decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 8: invalid start byte

Would highly appreciate a fix in upcoming versions!

jsvine commented 3 years ago

Thanks again @tungph for raising this issue and the suggested fix, and @devWhyqueue for seconding. The commit above should fix this once merged and should be available in the next release.

jsvine commented 2 years ago

This fix is now part of the latest release, v0.6.0.