piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.63k stars 4.37k forks source link

Train Word2Vec in multiple batches #1030

Closed iliaschalkidis closed 7 years ago

iliaschalkidis commented 7 years ago

Hi, I am looking forward to train a word2vec model with a vast corpus of documents (approximately 250GB). The machine which will accommodate the experiments has 128GB RAM, so it is impossible to train the model at once.

I had the same issue in the past with less data and I found this blog post (http://rutumulkar.com/blog/2015/word2vec), which suggested a solution, but it was not part of the main distribution in version 0.12.4.

I observed in gensim code that in current version 0.13.3, the _buildvocab() function supports the update parameter.

So I wrote a piece of code like this:

# LOAD FIRST BATCH/FOLDER
loader = DataLoader(folder=folder)
sentences = loader.load_corpus()
# TRAIN INITIAL MODEL
model = Word2Vec(min_count=20, workers=16, size=emb_size, sg=1, negative=5,window=window)
model.build_vocab(sentences)
model.train(sentences)
print('vocabulary size:', len(model.index2word))
sentences = None
# LOAD SECOND BATCH/FOLDER
loader = DataLoader(parent_folder=folder2)
sentences = loader.load_corpus()
# UPDATE VOCABULARY AND TRAIN MODEL
model.build_vocab(sentences, update=True)
model.train(sentences)
print('vocabulary size:', len(model.index2word))
model.save_word2vec_format(filename, binary=True)

Running with 2 folders (16 documents,9 documents), I had the following output:


vocabulary size: 473
vocabulary size: 482

I have the following questions based on the above:

Beginning the second round of training:

def build_vocab(self, sentences, keep_raw_vocab=False, trim_rule=None, progress_per=10000, update=False):

tmylk commented 7 years ago

Hi @KiddoThe2B,

such an open question is better suited for the mailing list

tmylk commented 7 years ago

Also see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/online_w2v_tutorial.ipynb

gojomo commented 7 years ago

There's no need for using the vocabulary-expansion (kinda-but-not-really 'online') 'update' feature here. Gensim doesn't require all documents to be in RAM - just a corpus that is 'Iterable', and thus can present all its examples each time Word2Vec needs them. (Once for build_vocab, then again iter times for training.) You should be able to change or replace your DataLoader class to only stream examples from disk, and that will both be most-memory-efficient, and also give the best vectors (by not confining some examples/words to only training early, then being 'diluted' by the later 'update').

iliaschalkidis commented 7 years ago

So @gojomo, you suggest to build a dictionary object from the whole corpus on my own, then call _buildvocab() once and finally call train() multiple times with different data from disk?

piskvorky commented 7 years ago

@KiddoThe2B see https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/ and https://rare-technologies.com/word2vec-tutorial/.

Like Lev said, not a bug and discussion more suited for the mailing list.

gojomo commented 7 years ago

@KiddoThe2B - No, make your corpus stream item-by-item from disk as an 'Iterable' object. Ask on list if you need more clarification.

iliaschalkidis commented 7 years ago

I read all suggested links, then I just renamed load_corpus() into __iter__() and replaced sentences.append(sentence) with yield sentence in my DataLoader class and it works like a charm!

So I figured out the problem, it seems everything is fine. I had to ask for support in the mailing list, but I'll keep in mind for the future.

Last question: Does lazy loading from iterable objects effects Word2Vec efficiency? I checked about the vocabulary and it is the same, but what about training? I suppose not, if I get it right....

Thank you all! I found the solution, but I also understood better generators and iterators in python :)

piskvorky commented 7 years ago

No problem. Re. your question: use the mailing list.