Closed SeekPoint closed 4 years ago
Based on the date of 2020-03-27, are you trying to run on CORD-19 data from that date? The format changed on 2020-05-12 and this method only supports data dumps from that date and on:
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html
Unless you have a specific reason, I would use the latest dump from the link above.
ok, I just want try it on a small file
Best thing to do would be to download the latest dump, extract it and filter a few rows from metadata.csv
For example
tar -xvzf cord-19_2020-08-12.tar.gz
cd 2020-08-12
mv metadata.csv metadata.csv.bkup
head -500 metadata.csv.bkup > metadata.csv
it works
ghSrc/paperetl % python -m paperetl.cord19 2020-03-27 Building articles database from 2020-03-27 multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/execute.py", line 184, in process sections, citations = Section.parse(row, indir) File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/section.py", line 49, in parse for path in Section.files(row): File "/usr/local/lib/python3.8/site-packages/paperetl/cord19/section.py", line 100, in files if row[column]: KeyError: 'pdf_json_files' """