lisa-groundhog / GroundHog

Library for implementing RNNs with Theano
BSD 3-Clause "New" or "Revised" License
598 stars 229 forks source link

RAM issue for training large models #27

Closed DmitryKey closed 9 years ago

DmitryKey commented 9 years ago

Hello!

It is quite affordable to train NMT model for a relatively small parallel corpus, order of 300k sentence pairs.

When I tried to train a 1m+ model, I get memory issues on Linux and very slow iterations (up to an hour) on Mac. Is there some easy win to mitigate this problem on servers with up to 16 GB RAM? Could you share some technical details of models you have trained, including the used hardware?

rizar commented 9 years ago

Sort of surprised to here that, because we are simply iterating sequentially through the dataset. We rely on pytables to do all the caching and unloading magic. Does memory usage grow during program execution or is it big from the beginning?

DmitryKey commented 9 years ago

On linux with 16 GB RAM the python process always takes a growing amount of RAM until running out (looks like a memory leak).

On mac with 8 GB the process takes all available memory. First attempt to run the training got auto-killed by OS, second and later attempts succeeded, but, as said, take about 1h per iteration (is it expected?)

rizar commented 9 years ago

Which iterator class do you use? One from here?

DmitryKey commented 9 years ago

Here is how I call https://github.com/lisa-groundhog/GroundHog/blob/master/experiments/nmt/train.py:

python train.py --proto=prototype_encdec_state "prefix='encdec-50_',seqlen=50,sort_k_batches=20" --state ru-data.py

in ru-data.py there is no iterator specific settings. One extra param in it is: reload=True

rizar commented 9 years ago

Then I guess it is an issue with the pytables module we use for loading data here. I had a feeling that it is not very stable this summer. I am afraid I can not help more. In fact this is not the most important component and I am pretty sure you can replace it by simple reading from a text file.

DmitryKey commented 9 years ago

Here is the screenshot of running under Mac:

screen shot 2015-03-08 at 14 27 08

DmitryKey commented 9 years ago

thanks @rizar I should take a look, though the code of this project is yet to familiarise with, so I might bug you guys a bit more later.

nouiz commented 9 years ago

Update pytables. Old version had memory bug out my memory is right. Le 8 mars 2015 09:09, "Dmitry Kan" notifications@github.com a écrit :

thanks @rizar https://github.com/rizar I should take a look, though the code of this project is yet to familiarise with, so I might bug you guys a bit more later.

— Reply to this email directly or view it on GitHub https://github.com/lisa-groundhog/GroundHog/issues/27#issuecomment-77747942 .

DmitryKey commented 9 years ago

currently used pytables version is the latest release: 1.3.1 released Mar 26, 2014.

Do you suggest to update to the current master of pytables?

On Mon, Mar 9, 2015 at 2:44 PM, Frédéric Bastien notifications@github.com wrote:

Update pytables. Old version had memory bug out my memory is right. Le 8 mars 2015 09:09, "Dmitry Kan" notifications@github.com a écrit :

thanks @rizar https://github.com/rizar I should take a look, though the code of this project is yet to familiarise with, so I might bug you guys a bit more later.

— Reply to this email directly or view it on GitHub < https://github.com/lisa-groundhog/GroundHog/issues/27#issuecomment-77747942

.

— Reply to this email directly or view it on GitHub https://github.com/lisa-groundhog/GroundHog/issues/27#issuecomment-77846357 .

nouiz commented 9 years ago

I recall that old release had problems and they got fixed in newer release. I do not try the development version. If it is easy for you to test, it could be useful to try.

Just a note, Groundhog will be replaced by blocks:

https://github.com/bartvm/blocks

But currently it do not have all Groundhog functionality. I was told that some people are working on that. I do not know if you use case is implemented right now in blocks or not.

@bartvm can you comment on that?

On Mon, Mar 9, 2015 at 9:23 AM, Dmitry Kan notifications@github.com wrote:

currently used pytables version is the latest release: 1.3.1 released Mar 26, 2014.

Do you suggest to update to the current master of pytables?

On Mon, Mar 9, 2015 at 2:44 PM, Frédéric Bastien <notifications@github.com

wrote:

Update pytables. Old version had memory bug out my memory is right. Le 8 mars 2015 09:09, "Dmitry Kan" notifications@github.com a écrit :

thanks @rizar https://github.com/rizar I should take a look, though the code of this project is yet to familiarise with, so I might bug you guys a bit more later.

— Reply to this email directly or view it on GitHub <

https://github.com/lisa-groundhog/GroundHog/issues/27#issuecomment-77747942

.

— Reply to this email directly or view it on GitHub < https://github.com/lisa-groundhog/GroundHog/issues/27#issuecomment-77846357

.

— Reply to this email directly or view it on GitHub https://github.com/lisa-groundhog/GroundHog/issues/27#issuecomment-77851774 .

rizar commented 9 years ago

Right, @DmitryKey, I would recommend you to switch to Blocks. The only advantage of Groundhog is that it has do-it-all scripts for machine translation with successful hyperparameter and algorithm choices. There is a demo script that trains a model capable of reversing words in a text, it should be a good starting point for a machine translation script.

DmitryKey commented 9 years ago

@nouiz I might update pytables to the master, that should be easy enough. Will update this thread with my findings.

@rizar sounds like a way to go, thanks for the demo script.

DmitryKey commented 9 years ago

Interesting: switching over to RNNSearch from RNN Encoder-Decoder has fixed the RAM issue. Currently the model is training at 5.5 GB RAM with 1-2 min per step. Closing this issue for now. Thanks everybody!

rizar commented 9 years ago

Hold on, did you train your model on CPU?

DmitryKey commented 9 years ago

@rizar yes. The model is still training.

Actually, regarding the RAM issue: got an out-of-memory again this morning after 30 iterations. I'm hoping this was an irregular glitch.

Current the CPU+RAM graphs:

selection_177