neuml / paperetl

📄 ⚙️ ETL processes for medical and scientific papers
Apache License 2.0
352 stars 27 forks source link

AttributeError: 'NoneType' object has no attribute 'upper' #46

Closed wmurphy126 closed 11 months ago

wmurphy126 commented 1 year ago

paperetl is great and has been useful for my work! It has been working well for most of the PDF papers I feed it. I am having some issues with certain PDFs. I am new to python, so its very likely I am doing something wrong but I thought I'd reach out.

When I run this for a specific PDF:

python3.10 -m paperetl.file /home/bill/brokenone /home/bill/brokenone /home/bill/brokenone

I get this error:

Processing: /home/bill/brokenone/20 Immune Cells Enhance Selectivity of Nanosecond-Pulsed DBD Plasma Against Tumor Cells.pdf /usr/local/lib/python3.10/dist-packages/bs4/builder/init.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument features="xml" into the BeautifulSoup constructor. warnings.warn( Process Process-1: Total articles inserted: 0 Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/usr/local/lib/python3.10/dist-packages/paperetl/file/execute.py", line 94, in process for result in Execute.parse(params): File "/usr/local/lib/python3.10/dist-packages/paperetl/file/execute.py", line 67, in parse yield PDF.parse(stream, source) File "/usr/local/lib/python3.10/dist-packages/paperetl/file/pdf.py", line 34, in parse return TEI.parse(xml, source) if xml else None File "/usr/local/lib/python3.10/dist-packages/paperetl/file/tei.py", line 55, in parse sections = TEI.text(soup, title) File "/usr/local/lib/python3.10/dist-packages/paperetl/file/tei.py", line 247, in text name = figure.get("xml:id").upper() AttributeError: 'NoneType' object has no attribute 'upper'

DrDeception commented 1 year ago

Did you manage to resolve this error?

avani17101 commented 1 year ago

I face the same error..

alimsvn commented 1 year ago

I got exaclty the same erorr...