neuml / paperetl

📄 ⚙️ ETL processes for medical and scientific papers
Apache License 2.0
342 stars 27 forks source link

KeyError: 'pdf_json_files' #3

Closed SeekPoint closed 4 years ago

SeekPoint commented 4 years ago

ghSrc/paperetl % python -m paperetl.cord19 2020-03-27 Building articles database from 2020-03-27 multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/execute.py", line 184, in process sections, citations = Section.parse(row, indir) File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/section.py", line 49, in parse for path in Section.files(row): File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/section.py", line 100, in files if row[column]: KeyError: 'pdf_json_files' """

davidmezzetti commented 4 years ago

Based on the date of 2020-03-27, are you trying to run on CORD-19 data from that date? The format changed on 2020-05-12 and this method only supports data dumps from that date and on:

https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html

Unless you have a specific reason, I would use the latest dump from the link above.

SeekPoint commented 4 years ago

ok, I just want try it on a small file

davidmezzetti commented 4 years ago

Best thing to do would be to download the latest dump, extract it and filter a few rows from metadata.csv

For example

tar -xvzf cord-19_2020-08-12.tar.gz
cd 2020-08-12
mv metadata.csv metadata.csv.bkup
head -500 metadata.csv.bkup > metadata.csv
SeekPoint commented 4 years ago

it works