materialsintelligence / mat2vec

Supplementary Materials for Tshitoyan et al. "Unsupervised word embeddings capture latent knowledge from materials science literature", Nature (2019).
MIT License
616 stars 180 forks source link

Formatting Abstracts #4

Closed emielke12 closed 5 years ago

emielke12 commented 5 years ago

Is there any special text formatting that needs to be done to abstracts before training? I noticed the corpus example has % and <nUm> in various places. Just wondering if formatting matters at all, or if you can dump the plain text from abstracts into a corpus file.

vtshitoyan commented 5 years ago

In principle, you don't have to do any special formatting. However, in the original paper, we used some pre-processing to reduce the size of the vocabulary and improve tokenization. This is beneficial if the text you are dealing with has to do with materials science/chemistry. You can use the process method here, then join back the tokens and dump it to the text corpus file. Let me know if this answers your question and I will close the issue.

emielke12 commented 5 years ago

Yes this answers my question. Thanks!