google-research / reincarnating_rl

[NeurIPS 2022] Open source code for reusing prior computational work in RL.
https://agarwl.github.io/reincarnating_rl
Apache License 2.0
91 stars 12 forks source link

Clarification on actor-critic training #3

Open pseudo-rnd-thoughts opened 1 year ago

pseudo-rnd-thoughts commented 1 year ago

Thanks for the paper, it is really cool and useful

On page 22 of the paper, it says

For reincarnating D4PG using QDagger, we minimize a distillation loss between the D4PG’s actor policy and the teacher policy from TD3 jointly with the actor-critic losses.

Is QDagger loss equal to the actor loss + critic loss + distillation loss for the actor policy (but not the distillation loss for the critic) for a given sample from the replay buffer? If so, what critic are you for training the actor in the offline training stage? It would seem that if use the student critic then you will get a "bad" critic at the begin that might mess up the agent rather that the teacher's critic. This doesn't seem to be specified anywhere.

Finally, thank you for open sourcing your code however I can't see the code for the TD3 -> D4PG, am I missing it or has that not been open sourced?

agarwl commented 7 months ago

Yeah, the code for TD3 -> D4PG is not open-sourced (it was written in acme which depended on some internal infra)

QDagger loss for D4PG is indeed done only for the actor. The critic is trained with on-policy samples from the actor, so the hope is that it would catch-up.