Open mingxuan opened 7 years ago
Is it CPU or GPU memory? How do you see that 4x difference?
How many GPUs are used in parallel?
Normally, it should not use more memory on the GPU. But it could use more memory on the CPU depending how you use it. Each process/GPU use extra CPU memory.
On Thu, Dec 8, 2016 at 3:52 AM, mingxuan notifications@github.com wrote:
I write a neural machine translation system with platoon. The batch size is 80 and sync every 10 mini-batches. I found that the memory cost about 4 times larger than the same system without platoon. Does someone else have the same experience?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mila-udem/platoon/issues/84, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-wAprY7wXTSZFxX8IQ0OHanPhwNdks5rF3DegaJpZM4LHYb0 .
It's GPU memory. I use the command "nvidia-smi" to see the GPU memory cost. I found that when use platoon, the memory cost is stable and the "GPU-util" is very closing to 100%. While without platoon, my gpu cost will change rapidly during training and the "CPU-Util" will also vary form about 30% to 100%. Would Platoon change the default config of Theano? Thanks for your help.
I meet the same problem, and it is worse for me to have the "out of memory" error, so my nmt system can not train with platoon at all. Have you finally solved this problem?
Thanks for your help.
The problem may comes from NCCL and pygpu. I find that theano built with NCCL and pygpu cost much more memory than previous version.
Yes. The more memory cost does caused by the new back-end of Theano. We prefer to use THEANO_FLAGS=gpuarray.preallocate=0.95,...
to pre-allocate GPU memory, where you could set 0.95 to any other digit \in (0, 1). See this issue
I write a neural machine translation system with platoon. The batch size is 80 and sync every 10 mini-batches. I found that the memory cost about 4 times larger than the same system without platoon. Does someone else have the same experience?
I have also test the "lstm" example, which cost about 5GB memory with 16 batch size and 1024 hidden size. Could some else help me to find the problem?