Closed scientist1642 closed 7 years ago
Hmm. At least not one that I can immediately find. How rapidly does the memory grow?
It makes sense to ask this question on discuss.pytorch.
Ok thanks, I also couldn't find something wrong with the code. I asked it on discuss. So you don't have a memory problem when you run it for hours right?
I ran it on a machine with a large amount of RAM, so I didn't even notice the problem :(
And luck identifying the leak? I've been trying to find it for days.
Were you using pip install version or self-compiled version? Has anyone A/B tested pip install version vs self-compiled version as suggested in https://discuss.pytorch.org/t/memory-usage-of-a-python-process-increases-slowly/1355/2 ?
I have tried both conda and pip versions. Both of them have this problem.
@ikostrikov same here, it's taking quite a time. Haven't found it yet. I didn't have problems with memory in python before, so at least I'm learning something :)
One note - it seems not related to multiprocessing as I initially thought. I don't use processes and removed LSTM cell in feedforward and the issue still there. @ethancaballero I also tried installing binaries with both, haven't tried self-compiled version yet.
The self-compiled version is supposedly more problematic, so maybe just stick with pip version for now.
Keep ablating it until it looks like the example pytorch actor_critic implementation to see which component is causing the leak: https://github.com/pytorch/examples/blob/master/reinforcement_learning/actor_critic.py
Hmm, the example actor_critic.py implementation also suffers from memory leak.
However, the memory leak for it and this pytorch-a3c repo are reduced by ~10x if you upgrade pytorch to the most recent nightly build (version '0.1.10+2fd4d08') (I just tested it).
Uninstall previous pytorch and then run this command to get nightly build that reduces leak:
pip3 install git+https://github.com/pytorch/pytorch
I think the main fix that occurred in the recent nightly build is this commit: https://github.com/pytorch/pytorch/commit/f531d98341d6c49f859ba21496f446c3189cb29d in response to this issue: https://discuss.pytorch.org/t/storing-torch-tensor-for-dqn-memory-issue/916/15
You might want to add a note to Readme saying to install version '0.1.10+2fd4d08' or later.
Thanks! This one definitely helps.
Ok. I have been running it for several hours. No signs of severe leaks anymore. Closing the issue. Thanks a lot!
Great, thanks! works for me as well.
@scientist1642 Hi, what vision have you used? My version is 0.1.12+0025e1c and memory leak problem is still here.
@xmfbit I worked on it before June, and several releases I installed after 0.1.10+2fd4d08 worked for me. Try a bit older version (like 0.1.11) maybe there is a bug in Pytorch again.
Training Breakout goes ok but, memory usage exceeds 25gb after 4 hours of training on 16 cpu cores. I wonder if it's related to sharing memory between processes.
I run Python 3.5 on scientific linux.