piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
965 stars 128 forks source link

Add medical corpora + pretrained models #5

Open piskvorky opened 6 years ago

piskvorky commented 6 years ago

The National Library of Medicine NLM license released a corpus of more than 27 million records with medical article metadata: ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/.

Each record contains the article's abstract (a short paragraph with article summary, typically ~1k characters), its authors, title, affiliation, a list of article topics including keywords and chemical formulas, year of publication etc.

Add this PubMed corpus to gensim-data, including pre-trained semantic models on this data.

License instruction are here: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/README.txt (read carefully), along with the full metadata schema (DTD).

piskvorky commented 6 years ago

Another related resource, the PubMed Central dataset: https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ (found via http://deepdive.stanford.edu/opendata/)

Unlike the metadata above, this (smaller) dataset also contains the article full texts.

Around 360,000 medical articles with full text in total.

piskvorky commented 6 years ago

Another related free (non-commercial use) bio medical corpus, including full text: https://old.biomedcentral.com/about/datamining

philgooch commented 6 years ago

There's a bunch of word2vec models trained on PubMed data here, and these work well in gensim:

These are all unigram models though iirc

menshikh-iv commented 6 years ago

@philgooch thanks for the links! Have you any license information about it (can we add it to gensim-data and "re-distribute")?

Imshepherd commented 6 years ago

training in R. https://github.com/Imshepherd/wordVectors-R-PubMed-Resourse

philgooch commented 6 years ago

@menshikh-iv The first set of models at http://evexdb.org/pmresources/vec-space-models/are CC-BY (see http://bio.nlplab.org/#license)

I'm waiting to hear back from the authors about the license for the other ones, I'll let you know as soon as I hear.

menshikh-iv commented 6 years ago

@philgooch great, we'll wait too :+1:

philgooch commented 6 years ago

@menshikh-iv I just heard back from Billy Chiu who developed the models at

https://github.com/cambridgeltl/BioNLP-2016

He's just updated the ReadMe there to confirm that the models at https://drive.google.com/open?id=0BzMCqpcgEJgiUWs0ZnU0NlFTam8 are also made available under CC BY