Issue with large tensors being stored in memory - unsure of origins

Garethlomax commented 5 years ago

Running snippet to check garbage collection shows a number of large (4096, 4096) tensors, without a good explanation of origin. May either be temporary which have noit been deleted a too costly during training, or be due to a memory leak.

Useful: https://discuss.pytorch.org/t/how-to-debug-causes-of-gpu-memory-leaks/6741/12

From above : Update 2: "Finally I solved the memory problem! I realized that in each iteration I put the input data in a new tensor, and pytorch generates a new computation graph. That causes the used RAM to grow forever. Then I use a placeholder tensor and copy the data to this tensor, and the RAM always stays at a low level :smile:"

check later

Garethlomax commented 5 years ago

Relevant : https://github.com/pytorch/pytorch/issues/2198

Garethlomax commented 5 years ago

Issue appears to be as a result of pytorch's computational tree structure for back propagation. need to detatch tensor to allow to bypass hidden memory issues

Garethlomax commented 5 years ago

for debugging a tool to visualize the computational graph of the lstm:

https://github.com/szagoruyko/functional-zoo/blob/master/visualize.py

https://github.com/szagoruyko/pytorchviz

Garethlomax commented 5 years ago

https://discuss.pytorch.org/t/solved-why-we-need-to-detach-variable-which-contains-hidden-representation/1426

Garethlomax commented 5 years ago

http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/

Garethlomax commented 5 years ago

useful cudnn: https://blog.paperspace.com/pytorch-memory-multi-gpu-debugging/

Garethlomax commented 5 years ago

progress made with cudnn memory - will still explore detatch approaches at a later date. Also will look into garbage collection inneficiencies.