twni2016 / Meta-SAC

Auto-tune the Entropy Temperature of Soft Actor-Critic via Metagradient - 7th ICML AutoML workshop 2020
MIT License
30 stars 4 forks source link

Several Issues of the Meta-SAC Implementation #2

Closed Cogito2012 closed 4 years ago

Cogito2012 commented 4 years ago

@twni2016 @yufeiwang63 Thanks for your great work! I am new to reinforcement learning and have some questions about your proposed meta-SAC.

Thanks again for your work! Looking forward to your reply.

yufeiwang63 commented 4 years ago

Hi, thanks for your interest in our work. As per your questions: 1) Value network: we do not use the value network in the final version of meta-sac. It is only used in one of the ablation studies, where we use the classic Q-value in the meta-objective. In terms of the code usage, the value network will only be used when the 'meta-Q' flag is set to be true. Please check our paper for more details on the meta-Q ablation. 2) alpha_embedding: we do not actually use alpha_embedding in the final version of meta-sac. It is a legacy of early experiments. If I recall correctly from the early experiments, adding it does not help improve the performance, and that's why we finally discarded it. 3) two Q losses. It is really just a matter of implementation. In the original implementation of SAC, their critic has two inner Q networks, while in our code, we explicitly maitain two critic, each has one Q network. That's why the loss are used in a different way, but they are essentially the same update.