Open AlphaFrank opened 5 years ago
Hi, thanks for your question. The deterministic mode you are referring to is a heuristic, and does not correspond to any optimal policy, but can sometimes yield a higher return when evaluated on maximum return objective. In the case of Gaussian policy, we typically use the mean action. In case of SQL, you can for example freeze the input noise to a value that has high probability (e.g. zero vector if the input noise is Gaussian with zero mean). There is no guarantees that the policy will work well, but I think that is the most sensible choice.
Thanks for the reply. I do have another question. It looks like a Gaussian policy is used here for SQL, instead of a stochastic neural network from the original SQL paper. Does switching to a Gaussian policy improve the performance?
Not quite sure what you mean. SQL works only with expressive policies, like SVGD. If you use a more restrictive class of policies, like Gaussian, then the algorithm actually corresponds to soft actor-critic, which in general has better performance on standard benchmarks.
Sorry, I mean in the code you provided, a Gaussian policy is used for SQL. I added a print statement to check the type of policy, and it printed Gaussian policy. Maybe I misunderstood something here? It would be great if I can get a clarification on which policy SQL is using in this implementation :). Btw, thanks for making this open-source!
Can you point me to the code?
I think @AlphaFrank is right, SQL currently uses the default GaussianPolicy
. This is my bad. The policy should be changed to the StochasticNNPolicy
that we used to have in our old repo. Interestingly things still work pretty well even with GaussianPolicy
:smile: For the results with the current setup (i.e. using GaussianPolicy
), see: https://github.com/rail-berkeley/softlearning/pull/23#issuecomment-459182354.
Hi, I read the original Soft Q-Learning paper and the policy in SQL is approximated by a neural network whose input is state and a random noise, and output is an action. I am wondering what is the deterministic action mode for SQL? Thanks!