ikostrikov / pytorch-a3c

PyTorch implementation of Asynchronous Advantage Actor Critic (A3C) from "Asynchronous Methods for Deep Reinforcement Learning".
MIT License
1.22k stars 280 forks source link

Possible memory leak? #11

Closed scientist1642 closed 7 years ago

scientist1642 commented 7 years ago

Training Breakout goes ok but, memory usage exceeds 25gb after 4 hours of training on 16 cpu cores. I wonder if it's related to sharing memory between processes.

I run Python 3.5 on scientific linux.

ikostrikov commented 7 years ago

Hmm. At least not one that I can immediately find. How rapidly does the memory grow?

It makes sense to ask this question on discuss.pytorch.

scientist1642 commented 7 years ago

Ok thanks, I also couldn't find something wrong with the code. I asked it on discuss. So you don't have a memory problem when you run it for hours right?

ikostrikov commented 7 years ago

I ran it on a machine with a large amount of RAM, so I didn't even notice the problem :(

And luck identifying the leak? I've been trying to find it for days.

ethancaballero commented 7 years ago

Were you using pip install version or self-compiled version? Has anyone A/B tested pip install version vs self-compiled version as suggested in https://discuss.pytorch.org/t/memory-usage-of-a-python-process-increases-slowly/1355/2 ?

ikostrikov commented 7 years ago

I have tried both conda and pip versions. Both of them have this problem.

scientist1642 commented 7 years ago

@ikostrikov same here, it's taking quite a time. Haven't found it yet. I didn't have problems with memory in python before, so at least I'm learning something :)

One note - it seems not related to multiprocessing as I initially thought. I don't use processes and removed LSTM cell in feedforward and the issue still there. @ethancaballero I also tried installing binaries with both, haven't tried self-compiled version yet.

ethancaballero commented 7 years ago

The self-compiled version is supposedly more problematic, so maybe just stick with pip version for now.

ethancaballero commented 7 years ago

Keep ablating it until it looks like the example pytorch actor_critic implementation to see which component is causing the leak: https://github.com/pytorch/examples/blob/master/reinforcement_learning/actor_critic.py

ethancaballero commented 7 years ago

Hmm, the example actor_critic.py implementation also suffers from memory leak.

However, the memory leak for it and this pytorch-a3c repo are reduced by ~10x if you upgrade pytorch to the most recent nightly build (version '0.1.10+2fd4d08') (I just tested it). Uninstall previous pytorch and then run this command to get nightly build that reduces leak: pip3 install git+https://github.com/pytorch/pytorch

I think the main fix that occurred in the recent nightly build is this commit: https://github.com/pytorch/pytorch/commit/f531d98341d6c49f859ba21496f446c3189cb29d in response to this issue: https://discuss.pytorch.org/t/storing-torch-tensor-for-dqn-memory-issue/916/15

You might want to add a note to Readme saying to install version '0.1.10+2fd4d08' or later.

ikostrikov commented 7 years ago

Thanks! This one definitely helps.

ikostrikov commented 7 years ago

Ok. I have been running it for several hours. No signs of severe leaks anymore. Closing the issue. Thanks a lot!

scientist1642 commented 7 years ago

Great, thanks! works for me as well.

xmfbit commented 7 years ago

@scientist1642 Hi, what vision have you used? My version is 0.1.12+0025e1c and memory leak problem is still here.

scientist1642 commented 7 years ago

@xmfbit I worked on it before June, and several releases I installed after 0.1.10+2fd4d08 worked for me. Try a bit older version (like 0.1.11) maybe there is a bug in Pytorch again.