Open therealjoker4u opened 1 month ago
I found the root of issue, that's because when I connect the policy_forward to collector (train_kwargs["policy"] = self.policy_forward
), for every iteration (step), the collector spawns another process and runs the policy on that and since my replay buffer and the policy_forward are on two separated processes then I get the error which the advantage gamma param is zero
Environment
OS: Windows 11 Python : CPython 3.10.14 Torchrl Version : 0.5.0 PyTorch Version : 2.4.1+cu124 Gym Environment: A custom subclass of EnvBase (from torchrl.envs)
The project I'm working on is relatively complex, so I only mention parts of code that I know are related to the bug that I mention below. Here's the definition of actor, value (critic), advantage, and loss module.
Training loop
My training loop catches the batched data from a
MultiSyncDataCollector
, and adds it to a replay buffer with aLazyTensorStorage
storage, and after that it samples and passes the sample to the _optimize_policy function:In the code above I got the error below when it called
self.actor_module(sample)
:So I added
sample["sample_log_prob"] = sample["sample_log_prob"].detach()
to detachsample_log_prob
from the computation graph. and the issue was solved.At this stage the model seems to converge, as objective and critic loss is minimizing: Figure 1 - Objective/Policy loss (Exponentially moving average interval 100):
Figure 2 - Critic loss:
The main issue
At this point apparently, everything is ok, but the main issue occurs when I connect the actor (policy) module to the collector, to collect data based on the current policy (not a random choice of actions):
And when I run it, I get the error below (thrown inside
self.advantage_module(sample)
):I found out that in
torchrl\objectives\value\functional.py
, and inside the functionvec_generalized_advantage_estimate
line 307,value
variable is vector of zeros (1d) with length of the sample batch size, but without connecting the actor_module it's the truth matrix of multiplied gammas and lambdas (with one column), and I found out that in the buffer of the advantage module , when the collector uses the actor module, it resetsgamma
andlmbda
of the buffer to 0.0 (Inside the training loopprint("Gamma : ", self.advantage_module.get_buffer("gamma"))
outputstensor(0.)
).So I added these tow lines after the loss module definition:
By adding these tow lines of code, the previous error vanished, but a new issue appeared:
That clearly implies that the key "loss_critic" does not exist in the sample tensordict object (but before I connect the actor module to the collector it calculates it properly).