Open piskvorky opened 6 years ago
Another related resource, the PubMed Central dataset: https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ (found via http://deepdive.stanford.edu/opendata/)
Unlike the metadata above, this (smaller) dataset also contains the article full texts.
Around 360,000 medical articles with full text in total.
Another related free (non-commercial use) bio medical corpus, including full text: https://old.biomedcentral.com/about/datamining
There's a bunch of word2vec models trained on PubMed data here, and these work well in gensim:
These are all unigram models though iirc
@philgooch thanks for the links! Have you any license information about it (can we add it to gensim-data and "re-distribute")?
training in R. https://github.com/Imshepherd/wordVectors-R-PubMed-Resourse
@menshikh-iv The first set of models at http://evexdb.org/pmresources/vec-space-models/are CC-BY (see http://bio.nlplab.org/#license)
I'm waiting to hear back from the authors about the license for the other ones, I'll let you know as soon as I hear.
@philgooch great, we'll wait too :+1:
@menshikh-iv I just heard back from Billy Chiu who developed the models at
https://github.com/cambridgeltl/BioNLP-2016
He's just updated the ReadMe there to confirm that the models at https://drive.google.com/open?id=0BzMCqpcgEJgiUWs0ZnU0NlFTam8 are also made available under CC BY
The National Library of Medicine NLM license released a corpus of more than 27 million records with medical article metadata: ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/.
Each record contains the article's abstract (a short paragraph with article summary, typically ~1k characters), its authors, title, affiliation, a list of article topics including keywords and chemical formulas, year of publication etc.
Add this PubMed corpus to
gensim-data
, including pre-trained semantic models on this data.License instruction are here: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/README.txt (read carefully), along with the full metadata schema (DTD).