Open pseudo-rnd-thoughts opened 1 year ago
Yeah, the code for TD3 -> D4PG is not open-sourced (it was written in acme which depended on some internal infra)
QDagger loss for D4PG is indeed done only for the actor. The critic is trained with on-policy samples from the actor, so the hope is that it would catch-up.
Thanks for the paper, it is really cool and useful
On page 22 of the paper, it says
Is QDagger loss equal to the actor loss + critic loss + distillation loss for the actor policy (but not the distillation loss for the critic) for a given sample from the replay buffer? If so, what critic are you for training the actor in the offline training stage? It would seem that if use the student critic then you will get a "bad" critic at the begin that might mess up the agent rather that the teacher's critic. This doesn't seem to be specified anywhere.
Finally, thank you for open sourcing your code however I can't see the code for the TD3 -> D4PG, am I missing it or has that not been open sourced?