neuml / paperetl

📄 ⚙️ ETL processes for medical and scientific papers
Apache License 2.0
352 stars 27 forks source link

Can't insert all my data into sqlite database. #56

Open kellytsorb opened 4 months ago

kellytsorb commented 4 months ago

Hello, I used grobid to covert 2657 pdf files in xml and then with this command #!python -m paperetl.file /Users/kellytsorb/paperetl/file/XML_files /Users/kellytsorb/paperetl/SQLite I insert the xml files into database that this comand creates but only 549 of these are inserted and I don't know why because in the past some of the papers that aren't inserted now I tried a smaller number of them and they were okk. Is there a limitation of number of articles that I can insert into database?

Process Process-2: Traceback (most recent call last): File "/Users/kellytsorb/anaconda3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/Users/kellytsorb/anaconda3/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, *self._kwargs) File "/Users/kellytsorb/anaconda3/lib/python3.11/site-packages/paperetl/file/execute.py", line 94, in process for result in Execute.parse(params): File "/Users/kellytsorb/anaconda3/lib/python3.11/site-packages/paperetl/file/execute.py", line 74, in parse yield TEI.parse(stream, source) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/kellytsorb/anaconda3/lib/python3.11/site-packages/paperetl/file/tei.py", line 37, in parse title = soup.title.text ^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'text' Total articles inserted: 549

Thank you in advance!