google / seq2seq

A general-purpose encoder-decoder framework for Tensorflow
https://google.github.io/seq2seq/
Apache License 2.0
5.61k stars 1.3k forks source link

Memory usage when training on CPU vs GPU #212

Open ghost opened 7 years ago

ghost commented 7 years ago

Hi

I noticed (by checking top on linux) that when I train a model on the CPU, the CPU is used quite heavily (which is normal, of course) while the memory usage only goes up to 3-5 % and fluctuates a lot.

When training on GPU, however, the memory usage of the CPU rises exponentially (I'm training a small model now on 2.000.000 training sentences with about 600.000 training steps and batch size 32; my CPU memory usage has gone up to about 60% already while I'm only at training step 300.000 or so). Furthermore, the memory doesn't seem to 'flush' during training, which results in the model occupying all memory.

Although this hasn't resulted in any errors yet, I am afraid that there will be some kind of memory error once the CPU memory gets around 100%. Does anyone know if this is normal behaviour or what could cause this?

Thanks in advance

mohgh commented 7 years ago

I've also got this problem. Excessive use of ram in evaluation time with gpu causes my process to get killed by os. Here is the log: May 11 20:40:04 paratech-pc kernel: [199450.406160] Out of memory: Kill process 25393 (python3) score 822 or sacrifice child May 11 20:40:04 paratech-pc kernel: [199450.406266] Killed process 25393 (python3) total-vm:80431720kB, anon-rss:26623524kB, file-rss:58516kB, shmem-rss:263168kB May 11 20:40:05 paratech-pc kernel: [199451.564611] oom_reaper: reaped process 25393 (python3), now anon-rss:0kB, file-rss:60528kB, shmem-rss:263168kB

Here you can see that it allocated 26 gigabytes of ram.