piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
965 stars 128 forks source link

Add cui2vec embeddings #25

Open souravsingh opened 6 years ago

souravsingh commented 6 years ago

The embeddings for over 100k medical concepts using data from 60 million patients, 1.7 million journal articles and 20 million notes is up, available here- https://figshare.com/s/00d69861786cd0156d81

Explorer available here- http://ec2-52-14-191-192.us-east-2.compute.amazonaws.com:1234/

piskvorky commented 6 years ago

Nice find!

menshikh-iv commented 6 years ago

Additional information:

beamandrew commented 6 years ago

Hey this is my paper, how cool! I'd be happy to contribute these, let me know if they need any clean up first.

menshikh-iv commented 6 years ago

Oh, hi @beamandrew, glad to see you here! Please follow the instruction https://github.com/RaRe-Technologies/gensim-data#want-to-add-a-new-corpus-or-model

beamandrew commented 6 years ago

Will do! It might be a couple weeks until I can get it together. I'm teaching a deep learning class right now that won't end until May which keeps me pretty busy.

I'm actually having them use the embeddings from this repo in class to build an RNN (which is how I ended up finding this issue).

You can check it out here if you're interested: https://colab.research.google.com/drive/1JsdhsiJQP5JPEEGWWFtOMpQajBj4w1KA

menshikh-iv commented 6 years ago

@beamandrew can you give read access for ivan@radimrehurek.com please (I can't open your link, lack of permissions)?

beamandrew commented 6 years ago

Oops, try this link which should let you view: https://drive.google.com/file/d/1WuoHWf1KyFsNiilbVa7qnKkSDALfch01/view?usp=sharing

matanox commented 6 years ago

Last I checked the actual concept names aren't include in this dataset and/or under the same license, but they are available from a different source which looks legitimately released. I have, in fact, a task to correlate them. Without this correlation, the embeddings discussed here include arbitrary codes instead of the original (concept) words that you see in the online demo.

hscells commented 6 years ago

I currently have some data that will allow for this mapping as @matanster describes from the author of this publication (Section 2).

If anyone is interested I can upload a link to this as I sit next to the author and he has given his permission @jimmyoentung.

piskvorky commented 6 years ago

Thanks guys.

What we want is for users who download this dataset to be able to use it easily.

If the dataset requires users to jump through hoops, it's not a good fit for gensim-data. The experience of applying / using a dataset has to be streamlined and intuitive, including access and code (not just data). That is why we created this repo, and it's a mandatory part of each new contribution.

@hscells and @matanster what does this extra step mean for users? Can we somehow integrate it directly, so it's transparent to people who want to use cui2vec? Is it necessary?

hscells commented 6 years ago

The CUI in cui2vec stands for Concept Unique Identifier. A CUI is an identifier for all of the types of synonyms for a particular medical string.

The dataset which I described in my comment is a mapping of CUI to the most commonly used string in the UMLS meta-thesaurus. One may simply replace the CUIs in the pre-trained vector file with terms from this mapping file (although I believe not all CUIs are mapped because the semantic types of the strings were filtered in this particular dataset).

One may use QuickUMLS or MetaMap to map a term to a CUI, then using the method described above map the CUI to the most commonly used term in UMLS or MetaMap.

I'm not exactly sure how the demo in the OP is mapping CUIs to strings, but I believe this is most likely how it would be done. In terms of how it could be integrated @piskvorky, the original data could be modified or this mapping could be performed in a separate step, however like I said, due to the relationship between CUI and the strings associated with that concept (one-to-many) this mapping would preferably be performed as two separate steps.

piskvorky commented 6 years ago

No problem, as long as the process is clearly described to users, and the dataset ready-to-use out of the box.

juancq commented 5 years ago

Just curious, any progress on this issue?

andresrosso commented 5 years ago

Hi, any body knows if the dataset 'cui2vec' is available?? @souravsingh share the vector in csv, but i don know how to load that in gensim and start using. Can anyone help me or tell em when the dataset would be ready.

andresrosso commented 5 years ago

The embeddings for over 100k medical concepts using data from 60 million patients, 1.7 million journal articles and 20 million notes is up, available here- https://figshare.com/s/00d69861786cd0156d81

Explorer available here- http://ec2-52-14-191-192.us-east-2.compute.amazonaws.com:1234/

@souravsingh can i load the CSV in gensim?

Can you tell me how to do that.

beamandrew commented 5 years ago

Hi everyone,

I am lead author on this paper. Apologies for the radio silence on this request. We are currently working on a revision to the paper/approach that we hope to release this month. I will check back in and try to make it gensim compatible at that time.

menshikh-iv commented 5 years ago

@juancq @andresrosso sorry for waiting, I can't say when this will be added BTW you always can load that manually (without api.load, just read the file from disk or s3).

menshikh-iv commented 5 years ago

@beamandrew great, thanks!

prabhatM commented 5 years ago

Is there any model using snowmed CT data?

Dhanachandra commented 5 years ago

Hi everyone,

I am lead author on this paper. Apologies for the radio silence on this request. We are currently working on a revision to the paper/approach that we hope to release this month. I will check back in and try to make it gensim compatible at that time.

Please share the source code for the evaluation metrics used in this work. I would like to evaluate my own embedding trained on EHRs. Thanks in advanced.

kaushikacharya commented 4 years ago

Hi, any body knows if the dataset 'cui2vec' is available?? @souravsingh share the vector in csv, but i don know how to load that in gensim and start using. Can anyone help me or tell em when the dataset would be ready.

@andresrosso Here are the steps for loading cui2vec in gensim:

  1. Download the pre-trained embeddings from the download url mentioned in http://cui2vec.dbmi.hms.harvard.edu/

  2. Dump the embeddings into a text file in word2vec format in these two steps:

  1. Load the word vectors using gensim.models.keyedvectors.KeyedVectors.
from gensim.models.keyedvectors import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format('cui2vec_pretrained.txt', binary=False)

# An example
word_vectors.most_similar('C0034079')

Source: https://stackoverflow.com/questions/46297740/how-to-turn-embeddings-loaded-in-a-pandas-dataframe-into-a-gensim-model (Ken Syme's answer)

andresrosso commented 4 years ago

Great work, thanks a lot.