Training data? - Githubissues

memray commented 5 years ago

Hi,

Thank you for sharing the amazing study. I wonder by which means we can acquire the datasets used for training the cui2vec:

a nationwide US health insurance plan with 60 million members over the period of 2008-2015,
a dataset of concept co-occurrences from 20 million notes at Stanford
an open access collection of 1.7 million full text journal articles obtained from PubMed Central (I know this is accessible)

Thank you, Rui Meng

hscells commented 5 years ago

Hi Rui,

This code is not affiliated with the authors of the publication. I recommend asking the actual authors.

a nationwide US health insurance plan with 60 million members over the period of 2008-2015

I do not see references in the paper for this dataset, so it is likely not available for public use (see section 3.1).

a dataset of concept co-occurrences from 20 million notes at Stanford

The authors reference this paper: https://www.nature.com/articles/sdata201432 (Building the graph of medicine from millions of clinical narratives)

an open access collection of 1.7 million full text journal articles obtained from PubMed Central (I know this is accessible)

This appears to be a subset of PMC, which is indeed freely available: https://www.ncbi.nlm.nih.gov/pmc/ (see the section called Developers). But it is unclear what the authors did to filter the articles.

It would be great if you do decide to contact the authors to respond to the issue with answers as I think it would be of benefit to anyone else wondering.

Cheers, Harry

memray commented 5 years ago

Hi Harry,

Thank you for your kind reply! I will contact the authors and come back once I have the answer.

Best, Rui

hscells / cui2vec

Training data? #3