som-shahlab / trove

Weakly supervised medical named entity classification
Apache License 2.0
69 stars 21 forks source link

UMLS. init_from_nlm_zip can't decode charmap #8

Open DavidLikesLearning opened 1 year ago

DavidLikesLearning commented 1 year ago

Describe the bug

I can't install the UMLS as directed by the tutorial notebooks. The UMLS object can't be initialized.

Steps to reproduce the bug

I downloaded the relevant zip file from the provided link (https://download.nlm.nih.gov/umls/kss/2020AB/umls-2020AB-metathesaurus.zip) and placed the file in the same directory as the 1_Installing_the_UMLS.ipynb notebook in the tutorials folder. Then I ran the notebook as given in the github.

Sample code to reproduce the bug

Expected results

A clear and concise description of the expected results.

Actual results

Specify the actual results or traceback.

The libraries and python version are all on the pdf attached

Environment info

troveDecodeError

jason-fries commented 1 year ago

Hi @elsirdavid Thanks for the detailed debugging information! Let's test a few things first (using the dev branch)

1. Can you confirm that the UMLS zip file isn't corrupted?

Test this via the command line md5 umls-2020AB-metathesaurus.zip --> 69d2929e0902e7e42af0b2cb74d5005a or using the use_checksum flag in UMLS.init_from_nlm_zip(NLM_ZIPFILE_PATH, use_checksum=True)

2. Try creating a new conda env using the enviornment.yml file

You can init from scratch using conda env create -f enviornment.yml

If neither of these fix the UMLS issue we can dive deeper into debugging.

DavidLikesLearning commented 1 year ago

Hi @jason-fries (and Happy New Year!!)

Thank you for your help.

I couldn't use the md5 command from the command line. I did use the checksum suggested and used other code to get a md5 hash of the file.

The checksum was added inline, the hash is below the list of python libraries in the environment. The UMLS code seems to have a problem with the declaration of the 'release' variable.

1_Installing_the_UMLS_md5_checksum.pdf

for the creation of a new environment, I used the 'requirements.txt' file as directed by the README. This manages to install some libraries but crashes when collecting scipy (error in preparing metadata regardign pyproject.toml).

troveDistUtilsFail

I installed msgpack, pandas by hand. The results were the same and are below:

1_Installing_the_UMLS-Copy-trove_env_md5_checksum.pdf

jason-fries commented 1 year ago

Hi @elsirdavid

Two issues: (1) For your MD5 hash check, your provided code

import hashlib
md5 = hashlib.md5(b'umls-2020AB-metathesaurus.zip')
print(md5, '\n',md5.digest()) 

generates a hash of the string literal not the contents of the UMLS zip file. You'll want to use

hashlib.md5(open("umls-2020AB-metathesaurus.zip", "rb").read()).hexdigest()

to generate a hash of the contents of the zip file. The above code snippet should return 69d2929e0902e7e42af0b2cb74d5005a for the 2020AB release. If you get a different number your file is corrupted and should be redownloaded from the NLM.

(2) Trove is only tested with Python 3.7.x. From your PDF it looks like your environment is 3.9.7 If you create a fresh env using conda env create -f environment.yml it should install the correct Python version.

On my machine installing from the latest trove dev branch commit using a fresh conda env works, so let's see if any of the above are the source of your issues.

Also make certain to wipe your temp directory (~/.trove/umls2022AB in your code) if the installation of the UMLS bombs out.

DavidLikesLearning commented 1 year ago

Could you point me to that environment.yml file? I can't find it in the github or any of the folders I've searched. The README from trove suggests using requirements.txt but as i mentioned earlier, that fails too. I'm not certain how to make this environment, then.

DavidLikesLearning commented 1 year ago

Also, thanks for fixing my hash code. It is indeed not corrupted, I do get the right hash thankfully.

DavidLikesLearning commented 1 year ago

Thank you for the changing branch idea. I have now tried to to use the relevatn yml file. The creation fails with the output in the included txt file. I am going to try to install the relevant libraries and python version by hand. create_env.txt

DavidLikesLearning commented 1 year ago

I ended up installing python 3.7, msgpack and pandas as the yml file directed and the resulting notebook is here: 1_Installing_the_UMLS_013123.pdf