scalability? - Githubissues

seomoz / word2gauss

Gaussian word embeddings

MIT License

190 stars 52 forks source link

scalability? #24

Closed acc4012 closed 6 years ago

acc4012 commented 6 years ago

I tried to train on a big corpus on Ubuntu machine with 64GB ram. There is not enough memory for training. Is it a kind of issue?

Corpus size = 100GB (raw unicode text) Vocab size = nearly 300000 Parameters are as in example: embed = GaussianEmbedding(len(vocab), 100, covariance_type='spherical', energy_type='KL')

The last log is as follows.
2018-03-10:18:21:59 INFO [threading.py:868] Processed 853870 batches, elapsed time: 10443.919032812119

The process is killed by OOM killer after this line.

matt-peters commented 6 years ago

This shouldn't run out of memory on a 64GB machine -- I have easily trained a model this large. The memory usage scales with num_token_vocab * embedding_dim. All of the memory is allocated at the beginning of training. After training for a few batches, how much memory is the process using?

A few likely other causes for this issue:

Are there other memory intensive processes also running on the machine?
Is the training process running in a VM (like vagrant) with restricted memory? Or restricted in its memory usage in some other way by a system setting or otherwise?

acc4012 commented 6 years ago

Thank matt-peters for reply.

There is no other potentially intensive process or memory restricted policy. I attach OOM killer log here. Killed process 30311 (python) total-vm:87803176kB, anon-rss:64515860kB, file-rss:80kB

I tried two to run training two times, it seems that the same error happen at the same batch (Processed 853870 batches) and the process got killed.

Is it possible for a very long sentence to cause similar problem? Could you teach me how to check the 853870th batch and around.

acc4012 commented 6 years ago

I tried to split long sentences to chunks of 1024 tokens and it was OK.
Note that the whole sentence is loaded to RAM before chunking and the sentence generator module works fine when just browsing through the corpus without feeding to word2gauss.

matt-peters commented 6 years ago

Ah yes, that makes sense. The batcher loads entire sentences at once so a very long sentence would require a lot of RAM. In any case, we only consider windows of a fixed length around each word, so there is no drawback to chunking the sentences into reasonably small chunks.

b4hand commented 6 years ago

I'm going to close this out since it seems like the original issue was worked around, but maybe a future improvement would be to automatically chunk the sentences on input.