Memory problem in building wiki2vec model via gensim

idio / wiki2vec

Generating Vectors for DBpedia Entities via Word2Vec and Wikipedia Dumps. Questions? https://gitter.im/idio-opensource/Lobby

601 stars 137 forks source link

Memory problem in building wiki2vec model via gensim #7

Open nooralahzadeh opened 9 years ago

nooralahzadeh commented 9 years ago

Hi Did you have memory problem in loading the trained wiki2vec model in gensim, I trained with size=500, window=10, min_count=10 based on last enwikipedia dump . So it created the 13g wiki2vec model, For loading on gensim I have memoryerror problem. Do you have any idea how much memory I need ?

dav009 commented 9 years ago

yeah, this is due to the vocabulary size. I think there has been some work around in gensim's wiki2vec implementation since last time I saw.

if you are only interested on getting the entities' vectors then @phdowling has a gensim branch for it. Which applies a filter of min_count on anything that is not an entity vector.

Otherwise reducing your vocabulary by either:

passing a higher min_count
There is some noise in the lib that cleans the wikitext and generates some garbage

nooralahzadeh commented 9 years ago

Exactly I want to have just entity vectors. what I have to do ? Thanks

dav009 commented 9 years ago

So I think best you can do at the moment is to use this gensim fork(the develop branch) : https://github.com/piskvorky/gensim/ that fork contains some changes which will help you deal with the vocab size.

One thing, depending on your current setup(linux or OsX) you might want to put attention on how to compile gensim using cython so that when gensim runs it makes use of all your cores.

give it a go and let us know if it goes alright.

jesuisnicolasdavid commented 8 years ago

Hi everyone, i have the same issue with the memory error. I am trying to increase the min_count to get rid of the error, but nothing is working. Any thought ? Is there a way to reduce the dimension from 1000 to maybe 300 ?

from gensim.models import Word2Vec
word2 = Word2Vec(min_count=100)
model = word2.load("/home/dev/work_devbox1/en_1000_no_stem/en.model")

phdowling commented 8 years ago

@jesuisnicolasdavid if that is literally the code you are running, then changing min_count will probably not help you. You're calling the load method - this doesn't train a new model, it simply loads an existing one. My guess is the existing model simply doesn't fit into RAM.

The min_count parameter applies if you're training a new model, more specifically it filters out words that don't occur frequently enough.

How big is the file you're trying to load and how much RAM does your machine have?

jesuisnicolasdavid commented 8 years ago

So the file is 9GB, i tried to run the model in a first computer with a TitanX and 16GB of RAM. The model is allocating all the ram and fall into a memory error before even going to the GPU. Then, i tried the same code in a second computer with two GTX 980 and 64GB of RAM : the wiki2vec model is taking 20GB alone. Then, i run into a GPU memory error with theano through keras which said :

('Error allocating 4604368000 bytes of device memory (out of memory).', "you might consider using 'theano.shared(..., borrow=True)'")

But i think i will move this question to a theano issue :)

dav009 commented 8 years ago

is this the model provided in the torrent? I've loaded successfully on a 16GB machine. If you are running into troubles you can try loading the model in a simple python script and then export the vectors to a plain file, that might be more flexible to work wihtout loading the whole thing.

jesuisnicolasdavid commented 8 years ago

Is there a way to make the 1000 dimensions of the pre-training a 300 dimensions ?

dav009 commented 8 years ago

Not that Im aware of, you can alway generate vectors of 300 dimensions, it should only take some hours. For languages other than englishthere are models with 200/300 dimensions.

phdowling commented 8 years ago

Yeah, I don't think there's an easy way to soundly change the dimensionality of the vectors.. You might be able to lower the RAM requirements by actually throwing away part of the vocabulary, i.e. loading less vectors, but this might also be quite hard if you're dealing with a raw numpy file and have no machine that can actually load it

jesuisnicolasdavid commented 8 years ago

Thanks guys i will try to generate a 300 dimension on my own. Im still wondering in what case a 1000 dimensions can be useful ?

vondiplo commented 8 years ago

@jesuisnicolasdavid have you been successful in creating a 300 dimension vocabulary?

dav009 commented 8 years ago

this is probably solved inthe newest gensim version. gonna check that out and bump the version that gets installed.

@vondiplo worth giving that a try ^