kashif / firedup

Clone of OpenAI's Spinning Up in PyTorch
MIT License
146 stars 25 forks source link

pi = policy.rsample()? #9

Closed zhan0903 closed 4 years ago

zhan0903 commented 4 years ago

Hi, why SAC use pi = policy.rsample() while VPG use pi = policy.sample()? Thanks.

kashif commented 4 years ago

@zhan0903 SAC uses the reprametrization trick to optimize the policy therefore we need to use rsample to be able to back-prop through it.

DDPG is also doing this, albeit explicitly. Hope that helps!