materialsintelligence / mat2vec

Supplementary Materials for Tshitoyan et al. "Unsupervised word embeddings capture latent knowledge from materials science literature", Nature (2019).
MIT License
616 stars 180 forks source link

Request for a step by step document on how to run the code #17

Closed webservicereco closed 3 years ago

webservicereco commented 4 years ago

Would you be kind enough to first share a step by step document on how to run the code using Jupyer Notebook in Anaconda3 environment available on https://github.com/materialsintelligence/mat2vec on the laptop with CPU only (i.e. without GPU). I am running Python 3,7.3 on Juptyter Notebook 6.0.2 in Anaconda3 with Tensorflo version 1.15.0, Keras version 2.2.4. (i) I have installed all packages mentioned in the requirements.txt including"ChemDataExtractor". But am running into issues with installing "molsets". Any guidance for that? (iii) l'd like to know which .py file(s) exactly to run, vs. a sequence of .py files to run, and any other tips. For example there are these .py files, setup.py, process.py, test-process.py, phrase2vec.py, etc. Assuming I want to simply run the model and get the output using Juptyer Notebook in Anaconda3 environment, what exactly do I have to run and in what order? Once I am able to run it on my laptop, I will attempt to run it in Colab. Basically would appreciate knowing, assuming I use Jupyer Notebook in Anaconda3 environment, what exactly should be the steps? Thanks.

jdagdelen commented 4 years ago

Hi,

We have step-by-step installation instructions in the README for this repo, which can be found at https://github.com/materialsintelligence/mat2vec. Note that we do not support python 3.7 yet, only 3.6, which may be the source of your problem.

Could you clarify what you'd like to do? Are you trying to train your own word embeddings on your own corpus or would you just like to load them and use them for another ML application?

Our word embeddings are based on Gensim Word2Vec and most of the functionality for using the pertained word embeddings is contained in that library, no this repo. You can find a tutorial on Gensim word2vec here: https://radimrehurek.com/gensim/models/word2vec.html

giotre commented 4 years ago

Hi, I would like to use your method for processing abstracts in scientific literature, exactly as you did. I was wondering if you can upload a step by step tutorial from of the "initial" part (download of the abstracts, processing, and abstract classification). Thank you in advance and best regards.

Tylersuard commented 4 years ago

@giotre Are you doing this for COVID-related things? If so, there is a dataset on Kaggle that includes thousands of COVID-related papers, and one of the columns is "abstracts." I was able to take that column and do NLP processing on my own (remove stop words, stem, remove non-letter characters) and then feed it into Mat2vec.

giotre commented 4 years ago

@Tylersuard In reality, I'm doing it for materials, exactly as they did in the related paper. So I would need to figure out, for instance, how to download abstracts (elsapy?), how to store them (have I to store them anywhere?), how to select relevant abstracts, and other stuff before NLP and Mat2Vec.

Tylersuard commented 4 years ago

The elsapy repo looks good. I would store the downloaded abstracts as a .csv or a .json file, and if you have enough hard drive space then you can just store it on your computer. Or, you can run it in a Kaggle notebook, they have fairly large space for saving datasets.

If you want to download from ARXIV.org, you can use the ARXIV API: https://github.com/allenai/science-parse

To parse PDF papers into abstracts and other pieces: https://github.com/allenai/science-parse

giotre commented 4 years ago

@Tylersuard Thank you so much for your answers. Since I'm new to this topic, do you know whether there are really "for dummies" step-by-step tutorials out there? (I mean practically from the API key onwards).

Tylersuard commented 4 years ago

@giotre You might ask the creators of this repo if they can share their dataset with you. Otherwise, here is another repo you might check out: https://github.com/ronentk/sci-paper-miner

You might also be able to hire someone on Upwork or Fiverr to construct a similar dataset for you.