Long GPU idle times in loss forward pass

I have just implemented an RL agent for a custom environment (wrapped into a TorchRL env). I am trying to reimplement the RAPS algorithm using SAC and for that I have used the SACLoss provided by TorchRL. Here, I mainly stuck to the examples/sac for structuring my code and setting everything up.

However, training the agent, I experienced bad GPU utilization. Profiling, I found that what takes most time is the SACLoss forward pass. I then proceeded using nsys profile in order to investigate further into this forward pass. In the screenshot attached, I have recorded a single representative forward pass through the SACLoss (after some warmup iterations). You can see that the GPU is only utilized for short times at the start and end of the forward pass and a sligthly longer period in the middle. Is this behavior expected? I also notice the CPU process where Python is running to be at 100. I am not sure what is causing this as all my networks are on GPU and there shouldn't be much also running during the Loss forward pass, right?

If all this not expected, how can I proceed in order to increase utilization (or first find out what is causing low utilization)?

Screenshots

Nvidia Nsight Systems Screenshot

Environment:

GPU: RTX 3070 Mobile
Python inside Conda Environment
- Pytorch 2.1.0
- torchrl 0.3.0

pytorch / rl

Long GPU idle times in loss forward pass #1954

Screenshots