Locating the memory leak

joy-xiaojizhang / seq2seq-pytorch

Pytorch implementation of Seq2Seq model for conversational agents

1 stars 0 forks source link

I tried training with a much smaller portion of the dataset. Here is the CPU memory usage after each epoch. It seems to increase linearly for the first 17 epochs, then stays stable around 4.8 GB till 25th epoch, and then starts increasing linearly again (before crashing). Is this expected behaviour? Epoch 1: 0.37 GB Epoch 2: 0.66 GB Epoch 3: 0.92 GB Epoch 4: 1.19 GB Epoch 5: 1.48 GB Epoch 6: 1.76 GB Epoch 7: 2.05 GB Epoch 8: 2.34 GB Epoch 9: 2.61 GB Epoch 10: 2.89 GB Epoch 11: 3.19 GB Epoch 12: 3.43 GB Epoch 13: 3.70 GB Epoch 14: 3.98 GB Epoch 15: 4.26 GB Epoch 16: 4.43 GB Epoch 17: 4.58 GB Epoch 18: 4.70 GB Epoch 19: 4.82 GB Epoch 20: 4.78 GB Epoch 21: 4.76 GB Epoch 22: 4.80 GB Epoch 23: 4.77 GB Epoch 24: 4.84 GB Epoch 25: 4.87 GB Epoch 26: 5.02 GB Epoch 27: 5.08 GB Epoch 28: 5.12 GB Epoch 29: 5.23 GB Epoch 30: 5.33 GB Epoch 31: 5.42 GB Epoch 32: out-of-memory

@nabihach I noticed the same behaviour when running the code on my GPU (GTX 1070), i.e. the memory usage did not increase linearly. I suspected that it was either a python version issue or the intermediate variables taking up memory, but neither of them was the actual problem. Also the problem is with the training phase, so setting volatile=True would not help either. I found this post on the pytorch forum describing a similar OOM issue when training a Seq2Seq model, which might be of help: https://discuss.pytorch.org/t/high-gpu-memory-demand-for-seq2seq-compared-to-tf/7480/7

Some other things that I plan to inspect:

GRUcell: The logic was written from scratch, so there might be flaws/potential spots for memory optimization
GRUDecoderRNN: forward() involves a few intermediate variables that are unnecessary

joy-xiaojizhang / seq2seq-pytorch

Locating the memory leak #3