Open DavidLikesLearning opened 1 year ago
Hi @elsirdavid
Thanks for the detailed debugging information! Let's test a few things first (using the dev
branch)
Test this via the command line md5 umls-2020AB-metathesaurus.zip
--> 69d2929e0902e7e42af0b2cb74d5005a
or using the use_checksum
flag in UMLS.init_from_nlm_zip(NLM_ZIPFILE_PATH, use_checksum=True)
enviornment.yml
fileYou can init from scratch using conda env create -f enviornment.yml
If neither of these fix the UMLS issue we can dive deeper into debugging.
Hi @jason-fries (and Happy New Year!!)
Thank you for your help.
I couldn't use the md5 command from the command line. I did use the checksum suggested and used other code to get a md5 hash of the file.
The checksum was added inline, the hash is below the list of python libraries in the environment. The UMLS code seems to have a problem with the declaration of the 'release' variable.
1_Installing_the_UMLS_md5_checksum.pdf
for the creation of a new environment, I used the 'requirements.txt' file as directed by the README. This manages to install some libraries but crashes when collecting scipy (error in preparing metadata regardign pyproject.toml).
I installed msgpack, pandas by hand. The results were the same and are below:
Hi @elsirdavid
Two issues: (1) For your MD5 hash check, your provided code
import hashlib
md5 = hashlib.md5(b'umls-2020AB-metathesaurus.zip')
print(md5, '\n',md5.digest())
generates a hash of the string literal not the contents of the UMLS zip file. You'll want to use
hashlib.md5(open("umls-2020AB-metathesaurus.zip", "rb").read()).hexdigest()
to generate a hash of the contents of the zip file. The above code snippet should return 69d2929e0902e7e42af0b2cb74d5005a
for the 2020AB release. If you get a different number your file is corrupted and should be redownloaded from the NLM.
(2) Trove is only tested with Python 3.7.x. From your PDF it looks like your environment is 3.9.7
If you create a fresh env using conda env create -f environment.yml
it should install the correct Python version.
On my machine installing from the latest trove dev
branch commit using a fresh conda env works, so let's see if any of the above are the source of your issues.
Also make certain to wipe your temp directory (~/.trove/umls2022AB
in your code) if the installation of the UMLS bombs out.
Could you point me to that environment.yml
file? I can't find it in the github or any of the folders I've searched. The README from trove suggests using requirements.txt
but as i mentioned earlier, that fails too. I'm not certain how to make this environment, then.
Also, thanks for fixing my hash code. It is indeed not corrupted, I do get the right hash thankfully.
Thank you for the changing branch idea. I have now tried to to use the relevatn yml file. The creation fails with the output in the included txt file. I am going to try to install the relevant libraries and python version by hand. create_env.txt
I ended up installing python 3.7, msgpack and pandas as the yml file directed and the resulting notebook is here: 1_Installing_the_UMLS_013123.pdf
Describe the bug
I can't install the UMLS as directed by the tutorial notebooks. The UMLS object can't be initialized.
Steps to reproduce the bug
I downloaded the relevant zip file from the provided link (https://download.nlm.nih.gov/umls/kss/2020AB/umls-2020AB-metathesaurus.zip) and placed the file in the same directory as the 1_Installing_the_UMLS.ipynb notebook in the tutorials folder. Then I ran the notebook as given in the github.
Sample code to reproduce the bug
Expected results
A clear and concise description of the expected results.
Actual results
Specify the actual results or traceback.
The libraries and python version are all on the pdf attached
Environment info
datasets
version:The libraries and python version are all on the pdf attached
The libraries and python version are all on the pdf attached