metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
http://www.metachris.com/pdfx
Apache License 2.0
1.05k stars 115 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 62: invalid continuation byte #48

Open Helias opened 3 years ago

Helias commented 3 years ago

Running pdfx file.pdf -v > output.txt I get this issue:

  File "/home/helias/.local/bin/pdfx", line 8, in <module>
    sys.exit(main())
  File "/home/helias/.local/lib/python3.8/site-packages/pdfx/cli.py", line 158, in main
    pdf = pdfx.PDFx(args.pdf)
  File "/home/helias/.local/lib/python3.8/site-packages/pdfx/__init__.py", line 128, in __init__
    self.reader = PDFMinerBackend(self.stream)
  File "/home/helias/.local/lib/python3.8/site-packages/pdfx/backends.py", line 236, in __init__
    refs = self.resolve_PDFObjRef(page.annots)
  File "/home/helias/.local/lib/python3.8/site-packages/pdfx/backends.py", line 273, in resolve_PDFObjRef
    return [self.resolve_PDFObjRef(item) for item in obj_ref]
  File "/home/helias/.local/lib/python3.8/site-packages/pdfx/backends.py", line 273, in <listcomp>
    return [self.resolve_PDFObjRef(item) for item in obj_ref]
  File "/home/helias/.local/lib/python3.8/site-packages/pdfx/backends.py", line 305, in resolve_PDFObjRef
    return Reference(obj_resolved["A"]["URI"].decode("utf-8"), self.curpage)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 62: invalid continuation byte

I guess it is related to some utf-8 codec, is there a way to solve it?

It should be related to this: https://github.com/metachris/pdfx/blob/master/pdfx/backends.py#L305

Helias commented 3 years ago

I solved replacing in the code decode('utf-8') with decode('ISO-8859-1'), I don't know if it's good to replace it or may we can do a try / except and in the except we can put the decode('ISO-8859-1')

Helias commented 3 years ago

I made a Pull Request for this, hope you will appreciate it.

For me it's a bit dirty the try/except but it works locally, may it's a good temporary solution.