neuml / paperetl

📄 ⚙️ ETL processes for medical and scientific papers
Apache License 2.0
342 stars 27 forks source link

Additional installation steps and bug for CORD-19 #6

Closed DavidRivasPhD closed 4 years ago

DavidRivasPhD commented 4 years ago

Hi David, In addition to following your paperetl installation instructions I had to take these steps to get rid of the following error and warning:

Screenshot from 2020-09-06 16-55-15

~$ python3
>>> import nltk
>>> nltk.download(‘punkt’)
>>> exit()        

this created a directory ~/nltk_data/tokenizers/punkt and fixed the above error

Also, the UserWarning below was eliminated as follows:

$ pip3 uninstall scikit-learn==0.23.2
$ pip3 install scikit-learn==0.23.1

Screenshot from 2020-09-09 10-36-26

Then, unfortunately after a full run the resulting articles.sqlite database came out with the Study Design fields, and Tags and Labels being NULL (see screenshots below).

Screenshot from 2020-09-11 07-27-52

Screenshot from 2020-09-11 07-29-24

Any ideas on how solve this NULL issue would be appreciated

davidmezzetti commented 4 years ago

Hi David,

Unfortunately, NLTK has different models that are installed with different installations. Thank you for sharing that you needed to install an additional model.

Regarding study design, not all documents will have a detected study design (will be 0). Only certain documents will be tagged as COVID-19.

Please take a look at: https://www.kaggle.com/davidmezzetti/cord-19-etl and https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings

These notebooks on Kaggle have all the background on how this process works.

DavidRivasPhD commented 4 years ago

Thank you David. After examining the results more carefully, unlike the above screenshots, I found plenty on non-NULL values (screenshots below); so the output is correct and there is no such a bug.
Yes, I have read your writings on GitHub, Kaggle, and tds as well as Ankur Mohan’s blogs about your work. Your models are very nice, I have learned a lot from them.

Screenshot from 2020-09-11 14-21-52 Screenshot from 2020-09-11 14-38-44 Screenshot from 2020-09-11 14-47-02