materialsintelligence / mat2vec

Supplementary Materials for Tshitoyan et al. "Unsupervised word embeddings capture latent knowledge from materials science literature", Nature (2019).
MIT License
616 stars 180 forks source link

my corpus is too big to be put in one large file #23

Open shikharsingla opened 4 years ago

shikharsingla commented 4 years ago

my corpus is too big to be put in one large file, my computer runs out of memory in doing that.

Is it possible to run this code on multiple files? or run it using iterator?

jdagdelen commented 3 years ago

Yes, it should theoretically be possible. https://stackoverflow.com/questions/63459657/how-to-load-large-dataset-to-gensim-word2vec-model

However, most corpora should normally not be more than a few GB. Are you sure your corpus is too large to be loaded into memory? How much memory does your computer have?