rail-berkeley / softlearning

Softlearning is a reinforcement learning framework for training maximum entropy policies in continuous domains. Includes the official implementation of the Soft Actor-Critic algorithm.
https://sites.google.com/view/sac-and-applications
Other
1.2k stars 239 forks source link

What does it mean to set 'eval_deterministic' = True for SQL? #63

Open AlphaFrank opened 5 years ago

AlphaFrank commented 5 years ago

Hi, I read the original Soft Q-Learning paper and the policy in SQL is approximated by a neural network whose input is state and a random noise, and output is an action. I am wondering what is the deterministic action mode for SQL? Thanks!

haarnoja commented 5 years ago

Hi, thanks for your question. The deterministic mode you are referring to is a heuristic, and does not correspond to any optimal policy, but can sometimes yield a higher return when evaluated on maximum return objective. In the case of Gaussian policy, we typically use the mean action. In case of SQL, you can for example freeze the input noise to a value that has high probability (e.g. zero vector if the input noise is Gaussian with zero mean). There is no guarantees that the policy will work well, but I think that is the most sensible choice.

AlphaFrank commented 5 years ago

Thanks for the reply. I do have another question. It looks like a Gaussian policy is used here for SQL, instead of a stochastic neural network from the original SQL paper. Does switching to a Gaussian policy improve the performance?

haarnoja commented 5 years ago

Not quite sure what you mean. SQL works only with expressive policies, like SVGD. If you use a more restrictive class of policies, like Gaussian, then the algorithm actually corresponds to soft actor-critic, which in general has better performance on standard benchmarks.

AlphaFrank commented 5 years ago

Sorry, I mean in the code you provided, a Gaussian policy is used for SQL. I added a print statement to check the type of policy, and it printed Gaussian policy. Maybe I misunderstood something here? It would be great if I can get a clarification on which policy SQL is using in this implementation :). Btw, thanks for making this open-source!

haarnoja commented 5 years ago

Can you point me to the code?

hartikainen commented 5 years ago

I think @AlphaFrank is right, SQL currently uses the default GaussianPolicy. This is my bad. The policy should be changed to the StochasticNNPolicy that we used to have in our old repo. Interestingly things still work pretty well even with GaussianPolicy :smile: For the results with the current setup (i.e. using GaussianPolicy), see: https://github.com/rail-berkeley/softlearning/pull/23#issuecomment-459182354.