paperetl is great and has been useful for my work! It has been working well for most of the PDF papers I feed it. I am having some issues with certain PDFs. I am new to python, so its very likely I am doing something wrong but I thought I'd reach out.
Processing: /home/bill/brokenone/20 Immune Cells Enhance Selectivity of Nanosecond-Pulsed DBD Plasma Against Tumor Cells.pdf
/usr/local/lib/python3.10/dist-packages/bs4/builder/init.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument features="xml" into the BeautifulSoup constructor.
warnings.warn(
Process Process-1:
Total articles inserted: 0
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, *self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/execute.py", line 94, in process
for result in Execute.parse(params):
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/execute.py", line 67, in parse
yield PDF.parse(stream, source)
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/pdf.py", line 34, in parse
return TEI.parse(xml, source) if xml else None
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/tei.py", line 55, in parse
sections = TEI.text(soup, title)
File "/usr/local/lib/python3.10/dist-packages/paperetl/file/tei.py", line 247, in text
name = figure.get("xml:id").upper()
AttributeError: 'NoneType' object has no attribute 'upper'
paperetl is great and has been useful for my work! It has been working well for most of the PDF papers I feed it. I am having some issues with certain PDFs. I am new to python, so its very likely I am doing something wrong but I thought I'd reach out.
When I run this for a specific PDF:
python3.10 -m paperetl.file /home/bill/brokenone /home/bill/brokenone /home/bill/brokenone
I get this error:
Processing: /home/bill/brokenone/20 Immune Cells Enhance Selectivity of Nanosecond-Pulsed DBD Plasma Against Tumor Cells.pdf /usr/local/lib/python3.10/dist-packages/bs4/builder/init.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument
features="xml"
into the BeautifulSoup constructor. warnings.warn( Process Process-1: Total articles inserted: 0 Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/usr/local/lib/python3.10/dist-packages/paperetl/file/execute.py", line 94, in process for result in Execute.parse(params): File "/usr/local/lib/python3.10/dist-packages/paperetl/file/execute.py", line 67, in parse yield PDF.parse(stream, source) File "/usr/local/lib/python3.10/dist-packages/paperetl/file/pdf.py", line 34, in parse return TEI.parse(xml, source) if xml else None File "/usr/local/lib/python3.10/dist-packages/paperetl/file/tei.py", line 55, in parse sections = TEI.text(soup, title) File "/usr/local/lib/python3.10/dist-packages/paperetl/file/tei.py", line 247, in text name = figure.get("xml:id").upper() AttributeError: 'NoneType' object has no attribute 'upper'