neuml / paperetl

📄 ⚙️ ETL processes for medical and scientific papers
Apache License 2.0
342 stars 27 forks source link

Error either with or without pre-trained attribute file #5

Closed DavidRivasPhD closed 4 years ago

DavidRivasPhD commented 4 years ago

I tried running paperetl in AWS (ubuntu 20.04 LTS t2.small instance with 50 GiB of memory) with the following procedure:

The cord-19_2020-09-01.tar.gz (release) dataset was downloaded and extracted in the following download path: ~/cordata This extraction created a directory ~/cordata/2020-09-01 containing the following files: ~/cordata/2020-09-01/document_parses.tar.gz ~/cordata/2020-09-01/metadata.csv document_parses.tar.gz was further extracted as a directory named document_parses, which contained the following 2 subdirectories: ~/cordata/2020-09-01/document_parses/pdf_json
~/cordata/2020-09-01/document_parses/pmc_json

entry-dates.csv generated in Kaggle https://www.kaggle.com/davidmezzetti/cord-19-article-entry-dates?scriptVersionId=41813239 was also placed in this directory; therefore, the command ~/cordata/2020-09-01$ python3 -m paperetl.cord19 . was executed from the ~/cordata/2020-09-01 directory containing the following: ~/cordata/2020-09-01/document_parses
~/cordata/2020-09-01/entry-dates.csv
~/cordata/2020-09-01/metadata.csv

The above procedure gave the following error: FileNotFoundError: [Errno 2] No such file or directory: '/home/ubuntu/.cord19/models/attribute'

This error resulted despite of the fact that I had pre-trained attribute and design files:

~/.cord19/models/attribute (from https://www.kaggle.com/davidmezzetti/cord19-study-design/#attribute) ~/.cord19/models/design (from https://www.kaggle.com/davidmezzetti/cord19-study-design/#design )

In another attempt, without using these 2 pre-trained files (i.e. starting with an empty ~/.cord19/models directory), I still got the exact same error message.

See error details in the following screenshot:

Screenshot from 2020-09-07 21-17-06

Any help would be appreciated.

davidmezzetti commented 4 years ago

Hi David,

Can you run the following?

ls -l ~/.cord19/models

What does that show? Maybe there are permission problems? Sometimes SELinux, if enabled, can also cause tricky permission issues.

You can pass along the paths directly on the command line call to paperetl to further debug:

python -m paperetl.cord19 .

# Parameters
# arg1: input directory
# arg2: database url
# arg3: model directory
# arg4: path to entry dates file
# arg5: full database load if True, only loads tagged articles if False

On another note, running the full CORD-19 dataset on a t2.small with 1 CPU is going to take a long time and 2 GB of memory may not be enough for that process, even with a single process. I would recommend an instance with at least 4 vCPUs and 8GB+ of RAM to process in a reasonable amount of time.

DavidRivasPhD commented 4 years ago

Thank you David for your advice. I’m now running paperetl in an AWS ubuntu 20.04 LTS instance Type: t3a.xlarge. Now my download path is just ~ (the home directory). I’m still getting the exact same error and it does not seem to be due to permission issues (see screenshot with ls -l). The error occurs when the created articles.sqlite database reaches the first 4000 articles, and at that point all the Study Design fields, and Tags and Labels are still NULL (see screenshots). Unless you have some new suggestions for debugging, we’ll try debugging it as you suggested above. This is another important question we have for you: 1) paperetl should only need as inputs the CORD-19 tarball (for a specific release date) and the entry-dates.cvs to be able to create the attribute.cvs and design.cvs (and of course the articles.sqlite) files by itself? 2) or do we always need to use the pre-trained attribute.cvs and design.cvs files (https://www.kaggle.com/davidmezzetti/cord19-study-design/#attribute and https://www.kaggle.com/davidmezzetti/cord19-study-design/#design) to run paperetl? If so don’t these pre-trained files get outdated for newer CORD-19 releases?

Screenshot from 2020-09-08 22-42-12

Screenshot from 2020-09-08 22-50-00

Screenshot from 2020-09-08 22-50-18

davidmezzetti commented 4 years ago

Hi David,

Please download the files attribute and design (no csv extensions) and put them in ~/.cord19/models - these are the pre-trained model files and they don't need to be updated once you download them once. attribute.csv and design.csv are training files and you can safely remove them. At some point, I should revisit the study design classification models to add additional training data but that is not necessary each run.

Each run iteration, you need the CORD-19 data archive file (decompressed to a directory) and corresponding entry-dates.csv in the same directory, this step looks good from the screenshots above.

If you download the attribute and design files and put them in ~/.cord19/models you should be able to have a full run.

The compute instance you have has 4 CPUs, I would estimate it would take somewhere between 3-5 hours to run on that instance. An instance with 8 CPUs would take about half that time, process scales linearly with more CPUs.

DavidRivasPhD commented 4 years ago

Thank you David for your valuable answers. Yes, that was the problem, by mistake I was using attribute.csv and design.csv instead of using the binary ones. We’ll continue working with your models.