materialsintelligence / mat2vec

Supplementary Materials for Tshitoyan et al. "Unsupervised word embeddings capture latent knowledge from materials science literature", Nature (2019).
MIT License
616 stars 180 forks source link

Prediction of (new) thermoelectric materials #15

Closed anita-clmnt closed 4 years ago

anita-clmnt commented 4 years ago

First of all thank you so much for sharing all this! I found the paper and the associated results very exciting!

I tried to reproduce Fig. 2a using your pre-trained model. I first printed the tokens the most similar to "thermoelectric" (highest cosine similarity). Then I used one of your processing functions (in the process script) to get only "simple chemical formulae". And finally, as you were mentioning it in the paper, I removed the formulae appearing less than 3 times in the corpus.

However, I ended up with a lot of noise in my list compared to yours. I got the 2 first same predictions but then I was also having formulae like Bi20Se3Te27 or SSe4Sn5 in the top 10. Just to give you an idea of the noise amount, PbTe, which is 3rd in your list, is 92th in mine.

So what am I missing?

Thank you in advance! Anita

vtshitoyan commented 4 years ago

Hi Anita, sorry for the late response. Your method sounds right, except we used output embeddings for materials. One way you could achieve this is to get the word embedding vector for "thermoelectric" and find the most similar words to this vector for output embeddings. This link might be useful: https://stackoverflow.com/questions/42554289/how-can-i-access-output-embeddingoutput-vector-in-gensim-word2vec Also make sure you are using the normalized output embeddings. Hope this helps!

anita-clmnt commented 4 years ago

Hi Vahe, Thank you for your response! I finally figured out how to get the same list. The link was very useful! My first mistake was to use "thermoelectric" instead of its output embedding in the most_similar function. After that, I was also keeping the elements with an occurrence of 3 but it seems that I should have kept the ones whose occurrence was >3 only. Thank you again for your help! Anita

iitklokesh commented 4 years ago

Hello! I am doing a similar study for my master's project. I used output embeddings but still there are some noises. I have not applied any occurrence parameter. I am very new to this kindly help in reproducing exact list. Also, I wanted to learn How can two keywords("key1" + "key2") be used where key1=thermoelectrics and key2=" any specific structure"?

jdagdelen commented 4 years ago

Hi Lokesh,

I believe you can just not supply a negative word, like so:

w2v_model.wv.most_similar(
    positive=["thermoelectric", "perovskite"], 
    topn=1)
iitklokesh commented 4 years ago

Hi John, Thank you so much for the response. I have done the same but there are still noises. Also, I wanted to know how the list was filtered using the number of occurrences that Vahe has mentioned as less than three.

jdagdelen commented 4 years ago

Can you clarify what you mean by noisy data? You may want to also refer to the Gensim documentation on the different methods for finding similar sets of words as there might better functions for your needs.

https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors

iitklokesh commented 4 years ago

Hi John, 1st issue: I am not able to reproduce the list exactly that is in the paper there are many other chemical formulas in between the actual formulas mentioned in the paper.

2nd issue: of two keywords Using this, _from gensim.models import Word2Vec import csv w2v_model = Word2Vec.load("mat2vec/training/models/pretrained_embeddings") mylist=w2v_model.wv.mostsimilar(positive=['thermoelectric','perovskite'],topn=10) mylist I am getting this output, [('thermoelectric_properties', 0.7222813367843628), ('perovskites', 0.7098876237869263), ('Ca100Co131O312', 0.6918851137161255), ('Mo150Ni3Sb270Te80', 0.6901201009750366), ('thermoelectrics', 0.6884835362434387), ('MoO6Sr2Ti', 0.6838506460189819), ('CoMoO12Sr4Ti2', 0.682638943195343), ('La4Mn5O15Tb', 0.6811503171920776), ('Ba4InO12YbZr2', 0.6792004108428955), ('Mg2(Si,Sn)', 0.6788942813873291)] I want to remove the bold results. I want to get the chemical formulas only like in the paper.

jdagdelen commented 4 years ago

1st issue: Are you using the provided pretrained word embeddings or are you training on your own corpus? Can you provide code examples of how you are doing the search so we can help you debug?

2nd issue: To filter out non-material embeddings we compare the embeddings to our list of materials built using Named Entity Recognition and a rule-based parsing tool. However, it would probably not be too hard to build and train a classifier that filters out non-material embeddings using the word embeddings as input. I'm sorry we haven't made the entire pipeline of tools available to the public yet. Olga Kononova will be publishing a paper soon on the rule-based parser and at that point we can make it public. (Sorry, I was confused. this is how we're doing it now but for the study that this repo supports we just used a simple parser based on the Pymatgen Composition object.)

2nd issue: You can use process.is_simple_formula

iitklokesh commented 4 years ago

Hi John,

1st issue: pretrained_embeddings downloaded from README.md link from gensim.models import Word2Vec import csv w2v_model = Word2Vec.load("mat2vec/training/models/pretrained_embeddings") mylist=w2v_model.wv.most_similar(positive=['thermoelectric'],topn=10) mylist Output: [('thermoelectrics', 0.8435688018798828), ('thermoelectric_properties', 0.8339031934738159), ('thermoelectric_power_generation', 0.7931368947029114), ('thermoelectric_figure_of_merit', 0.7916494607925415), ('seebeck_coefficient', 0.7753844857215881), ('thermoelectric_generators', 0.7641353011131287), ('figure_of_merit_ZT', 0.7587920427322388), ('thermoelectricity', 0.7515754699707031), ('Bi2Te3', 0.7480159997940063), ('thermoelectric_modules', 0.7434878945350647)] the same problem of getting non-materials, I am trying to remove these. Thank You for help!

vtshitoyan commented 4 years ago

@iitklokesh

  1. The vocabulary of a gensim model has word counts, you can write a simple piece of code to filter the results out based on a word count. E.g. see this stackoverflow post https://stackoverflow.com/questions/37190989/how-to-get-vocabulary-word-count-from-gensim-word2vec
  2. The paper does not use NER to filter out non-materials, it uses a simple function available in process.py https://github.com/materialsintelligence/mat2vec/blob/a4ae89d519e5478a777e81de1ffb4b7c5606b9ea/mat2vec/processing/process.py#L265 Hope this helps Vahe
iitklokesh commented 4 years ago

Thank You so much Vahe Sir!

jdagdelen commented 4 years ago

@iitklokesh Sorry, I was confused. This is how we're doing it now but for the study that this repo supports we just used a simple parser based on the pymatgen Composition object. Note that the simple parsing method won't work for words/phrases like "lithium chloride" or "LMNCO".

iitklokesh commented 4 years ago

Thank you, John! I am using normalization for those cases.