materialsintelligence / mat2vec

Supplementary Materials for Tshitoyan et al. "Unsupervised word embeddings capture latent knowledge from materials science literature", Nature (2019).
MIT License
616 stars 180 forks source link

Script to fetch cleaned abstracts #22

Closed leotam closed 3 years ago

leotam commented 4 years ago

I noticed you've nicely provided the DOIs, but a simple pull will fetch the article in raw html. Might you have a recommendation on grabbing the cleaned abstracts as you did it? There's the several dataset splits that you mentioned being quite influential on the final result.

jdagdelen commented 3 years ago

Hi,

Sorry this issue went unanswered for so long. One can use the Scopus API and the dois to scrape most of the content. Unfortunately we can't share dumps of our corpus and metadata due to copyright.