Closed 4thfever closed 3 years ago
Hi,
Thanks for your interest.
As in most RL algorithms (especially Soft Actor-Critic), our agent basically learns stochastic policy (i.e. learn gaussian policy using sample()) in training phase, whereas it uses deterministic policy (i.e. decide actions only by gaussian mean using mean()) in evaluation phase.
This kind of strategy is broadly used in RL algorithms due to the stability of evaluation performance.
Bests,
Thanks!
Hi,
Thanks for your fantastic job and sharing.
I am wondering what is the meaning under the actor's sample() and mean()? I read your paper but didn't find any explanation.
Thanks in advance.