Closed iliaschalkidis closed 7 years ago
Hi @KiddoThe2B,
such an open question is better suited for the mailing list
There's no need for using the vocabulary-expansion (kinda-but-not-really 'online') 'update' feature here. Gensim doesn't require all documents to be in RAM - just a corpus that is 'Iterable', and thus can present all its examples each time Word2Vec needs them. (Once for build_vocab, then again iter
times for training.) You should be able to change or replace your DataLoader
class to only stream examples from disk, and that will both be most-memory-efficient, and also give the best vectors (by not confining some examples/words to only training early, then being 'diluted' by the later 'update').
So @gojomo, you suggest to build a dictionary object from the whole corpus on my own, then call _buildvocab() once and finally call train() multiple times with different data from disk?
@KiddoThe2B see https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/ and https://rare-technologies.com/word2vec-tutorial/.
Like Lev said, not a bug and discussion more suited for the mailing list.
@KiddoThe2B - No, make your corpus stream item-by-item from disk as an 'Iterable' object. Ask on list if you need more clarification.
I read all suggested links, then I just renamed load_corpus()
into __iter__()
and replaced sentences.append(sentence)
with yield sentence
in my DataLoader class and it works like a charm!
So I figured out the problem, it seems everything is fine. I had to ask for support in the mailing list, but I'll keep in mind for the future.
Last question: Does lazy loading from iterable objects effects Word2Vec efficiency? I checked about the vocabulary and it is the same, but what about training? I suppose not, if I get it right....
Thank you all! I found the solution, but I also understood better generators and iterators in python :)
No problem. Re. your question: use the mailing list.
Hi, I am looking forward to train a word2vec model with a vast corpus of documents (approximately 250GB). The machine which will accommodate the experiments has 128GB RAM, so it is impossible to train the model at once.
I had the same issue in the past with less data and I found this blog post (http://rutumulkar.com/blog/2015/word2vec), which suggested a solution, but it was not part of the main distribution in version 0.12.4.
I observed in gensim code that in current version 0.13.3, the _buildvocab() function supports the update parameter.
So I wrote a piece of code like this:
Running with 2 folders (16 documents,9 documents), I had the following output:
I have the following questions based on the above:
Beginning the second round of training:
Are the first 473 embeddings initialized as they came out from the first round of training?
Will they improve further using the sentences of the second folder or are they "frozen" during the second round?
I tried to save the model and load it again between the two rounds but I had the following error:
AttributeError: 'Word2Vec' object has no attribute 'syn1neg
Is it possible to do so, save-load and extend-improve my model?Can I have an explanation of the other available parameters of _buildvocab() function:
def build_vocab(self, sentences, keep_raw_vocab=False, trim_rule=None, progress_per=10000, update=False):