materialsintelligence / mat2vec

Supplementary Materials for Tshitoyan et al. "Unsupervised word embeddings capture latent knowledge from materials science literature", Nature (2019).
MIT License
616 stars 180 forks source link

About the final word embeddings. #27

Open hasan-sayeed opened 2 years ago

hasan-sayeed commented 2 years ago

I just trained a model on my own corpus. It has space group numbers and I replaced them with 'Xx1, Xx2,...., Xx229, Xx230' to avoid overlapping with some element names. But when I tried to get final embeddings from the model it says, some space group numbers (Xx105, Xx139 etc.) are not in vocabulary independent of the frequency! Why is this happening? I've tried to look up the code and couldn't figure it out.

jdagdelen commented 2 years ago

Are you sure that those tokens occur enough times in the corpus to get their own spots in the vocabulary? This repo uses Gensim's Word2Vec implementation, which constructs the vocabulary by finding a min_count cutoff. It could be the case that some of your special tokens don't occur frequently enough in your corpus to make the cut.

Docs for Gensim Word2Vec. Check out the min_count, max_vocab_size, and max_final_vocab parameters. You can also use trim_rule to enforce your special tokens are included in the final vocab.

hasan-sayeed commented 2 years ago

Yeah, those tokens occur more than my --min_count. I tried to use trim_rule as well but the same thing is happening. And I tried with different corpus files (basically deleting half of the data from the original file every time) and it seems to miss different words every time. I'm guessing there is something wrong with my tokenization. This reminds me, when I use processing and get my corpus ready and try to train the model on this file it shoots an error, UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 165: invalid start byte. So I replaced the contents of corpus_example file with my data and then it ran fine. Can this be the issue? What can I do to solve this UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 165: invalid start byte error?

jdagdelen commented 2 years ago

Your issues seem to be specific to how Gensim is interacting with your corpus and how it builds vocabulary, not necessarily tokenization. I think you may want to bring question to the Gensim mailing list/support group. https://groups.google.com/g/gensim

jdagdelen commented 2 years ago

FWIW, I don't think the utf-8 decoding error is related, but to be sure can you please confirm what version of python you are using and ensure that your corpus is encoded properly and doesn't contain any illegal characters?