I very admire SAC you created.
I have one guess about SAC's policy, and I would like to your confirm:
We assume that q function obeys Boltzmann distribution, but it seems difficult to code that the policy obeys Boltzmann distribution. Therefore, we actually code that the policy obeys the most commonly used Gaussian distribution.
However, when there are more than one good actions in the same state,q function is multimodal, and the Gaussian distribution tends to be flat and thus becomes weak. As the article description https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/
Is the comprehension correct that "it is difficult to code that the policy obeys Boltzmann distribution"? Which have been distributions with better performance than the Gaussian distribution? I want to ask for your opinion.
Sorry, I see that you wrote GMM, hierarchical policy and latent_space_policy. I think I still need to spend some time to master them. They maybe solve my question.
I very admire SAC you created. I have one guess about SAC's policy, and I would like to your confirm:
Is the comprehension correct that "it is difficult to code that the policy obeys Boltzmann distribution"? Which have been distributions with better performance than the Gaussian distribution? I want to ask for your opinion.
Looking forward to your reply!