Train Word2Vec in multiple batches

iliaschalkidis commented 7 years ago

Hi, I am looking forward to train a word2vec model with a vast corpus of documents (approximately 250GB). The machine which will accommodate the experiments has 128GB RAM, so it is impossible to train the model at once.

I had the same issue in the past with less data and I found this blog post (http://rutumulkar.com/blog/2015/word2vec), which suggested a solution, but it was not part of the main distribution in version 0.12.4.

I observed in gensim code that in current version 0.13.3, the _buildvocab() function supports the update parameter.

So I wrote a piece of code like this:

# LOAD FIRST BATCH/FOLDER
loader = DataLoader(folder=folder)
sentences = loader.load_corpus()
# TRAIN INITIAL MODEL
model = Word2Vec(min_count=20, workers=16, size=emb_size, sg=1, negative=5,window=window)
model.build_vocab(sentences)
model.train(sentences)
print('vocabulary size:', len(model.index2word))
sentences = None
# LOAD SECOND BATCH/FOLDER
loader = DataLoader(parent_folder=folder2)
sentences = loader.load_corpus()
# UPDATE VOCABULARY AND TRAIN MODEL
model.build_vocab(sentences, update=True)
model.train(sentences)
print('vocabulary size:', len(model.index2word))
model.save_word2vec_format(filename, binary=True)

Running with 2 folders (16 documents,9 documents), I had the following output:


vocabulary size: 473
vocabulary size: 482

I have the following questions based on the above:

Beginning the second round of training:

Are the first 473 embeddings initialized as they came out from the first round of training?
Will they improve further using the sentences of the second folder or are they "frozen" during the second round?
I tried to save the model and load it again between the two rounds but I had the following error: AttributeError: 'Word2Vec' object has no attribute 'syn1neg Is it possible to do so, save-load and extend-improve my model?
Can I have an explanation of the other available parameters of _buildvocab() function:

def build_vocab(self, sentences, keep_raw_vocab=False, trim_rule=None, progress_per=10000, update=False):

Is there any simple rule to estimate the number of batches I will need to load separately in order to avoid memory errors?

tmylk commented 7 years ago

Hi @KiddoThe2B,

such an open question is better suited for the mailing list

tmylk commented 7 years ago

Also see https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/online_w2v_tutorial.ipynb

gojomo commented 7 years ago

There's no need for using the vocabulary-expansion (kinda-but-not-really 'online') 'update' feature here. Gensim doesn't require all documents to be in RAM - just a corpus that is 'Iterable', and thus can present all its examples each time Word2Vec needs them. (Once for build_vocab, then again iter times for training.) You should be able to change or replace your DataLoader class to only stream examples from disk, and that will both be most-memory-efficient, and also give the best vectors (by not confining some examples/words to only training early, then being 'diluted' by the later 'update').

iliaschalkidis commented 7 years ago

So @gojomo, you suggest to build a dictionary object from the whole corpus on my own, then call _buildvocab() once and finally call train() multiple times with different data from disk?

piskvorky commented 7 years ago

@KiddoThe2B see https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/ and https://rare-technologies.com/word2vec-tutorial/.

Like Lev said, not a bug and discussion more suited for the mailing list.

gojomo commented 7 years ago

@KiddoThe2B - No, make your corpus stream item-by-item from disk as an 'Iterable' object. Ask on list if you need more clarification.

iliaschalkidis commented 7 years ago

I read all suggested links, then I just renamed load_corpus() into __iter__() and replaced sentences.append(sentence) with yield sentence in my DataLoader class and it works like a charm!

So I figured out the problem, it seems everything is fine. I had to ask for support in the mailing list, but I'll keep in mind for the future.

Last question: Does lazy loading from iterable objects effects Word2Vec efficiency? I checked about the vocabulary and it is the same, but what about training? I suppose not, if I get it right....

Thank you all! I found the solution, but I also understood better generators and iterators in python :)

piskvorky commented 7 years ago

No problem. Re. your question: use the mailing list.

piskvorky / gensim

Train Word2Vec in multiple batches #1030