pytorch / rl

A modular, primitive-first, python-first PyTorch library for Reinforcement Learning.
https://pytorch.org/rl
MIT License
2.3k stars 306 forks source link

Long GPU idle times in loss forward pass #1954

Open JulianKu opened 8 months ago

JulianKu commented 8 months ago

I have just implemented an RL agent for a custom environment (wrapped into a TorchRL env). I am trying to reimplement the RAPS algorithm using SAC and for that I have used the SACLoss provided by TorchRL. Here, I mainly stuck to the examples/sac for structuring my code and setting everything up.

However, training the agent, I experienced bad GPU utilization. Profiling, I found that what takes most time is the SACLoss forward pass. I then proceeded using nsys profile in order to investigate further into this forward pass. In the screenshot attached, I have recorded a single representative forward pass through the SACLoss (after some warmup iterations). You can see that the GPU is only utilized for short times at the start and end of the forward pass and a sligthly longer period in the middle. Is this behavior expected? I also notice the CPU process where Python is running to be at 100. I am not sure what is causing this as all my networks are on GPU and there shouldn't be much also running during the Loss forward pass, right?

If all this not expected, how can I proceed in order to increase utilization (or first find out what is causing low utilization)?

Screenshots

Nvidia Nsight Systems Screenshot

Environment:

vmoens commented 8 months ago

I can have a look at that! Thanks for pointing it out, will keep you posted