metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
http://www.metachris.com/pdfx
Apache License 2.0
1.03k stars 113 forks source link

Changed desc.getchildren() to desc.iter() #45

Closed vmanke closed 3 years ago

vmanke commented 3 years ago

Now pdfx will also run with newer Python versions installed.

almereyda commented 3 years ago

Since there is no associated issue, documenting the error here:

$ pip install --user pdfx

$ python --version              
Python 3.9.2

$ pdfx -v 10-years-Report-EN.pdf 
Traceback (most recent call last):
  File "/home/yala/.local/bin/pdfx", line 8, in <module>
    sys.exit(main())
  File "/home/yala/.local/lib/python3.9/site-packages/pdfx/cli.py", line 149, in main
    pdf = pdfx.PDFx(args.pdf)
  File "/home/yala/.local/lib/python3.9/site-packages/pdfx/__init__.py", line 127, in __init__
    self.reader = PDFMinerBackend(self.stream)
  File "/home/yala/.local/lib/python3.9/site-packages/pdfx/backends.py", line 182, in __init__
    self.metadata.update(xmp_to_dict(metadata))
  File "/home/yala/.local/lib/python3.9/site-packages/pdfx/libs/xmp.py", line 89, in xmp_to_dict
    return XmpParser(xmp).meta
  File "/home/yala/.local/lib/python3.9/site-packages/pdfx/libs/xmp.py", line 51, in meta
    for el in desc.getchildren():
AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'getchildren'

I have tested this patch, and it works sufficiently.

$ wget https://github.com/metachris/pdfx/pull/45/commits/f154c7c24e81a66e4a73f3a9dde71e3a123f30f6.patch

$ patch ~/.local/lib/python3.9/site-packages/pdfx/libs/xmp.py f154c7c24e81a66e4a73f3a9dde71e3a123f30f6.patch

$ pdfx -v 10-years-Report-EN.pdf

Document infos:
...
metachris commented 3 years ago

Thanks!