The comprehension of the policy limitations in SAC

I very admire SAC you created. I have one guess about SAC's policy, and I would like to your confirm:

We assume that q function obeys Boltzmann distribution, but it seems difficult to code that the policy obeys Boltzmann distribution. Therefore, we actually code that the policy obeys the most commonly used Gaussian distribution. However, when there are more than one good actions in the same state,q function is multimodal, and the Gaussian distribution tends to be flat and thus becomes weak. As the article description https://bair.berkeley.edu/blog/2017/10/06/soft-q-learning/

Is the comprehension correct that "it is difficult to code that the policy obeys Boltzmann distribution"? Which have been distributions with better performance than the Gaussian distribution? I want to ask for your opinion.

Looking forward to your reply!

haarnoja / sac

The comprehension of the policy limitations in SAC #31