Closed NikEyX closed 4 years ago
So I fiddled around more with this, recreated your implementation and compared it to a Gumbel-Softmax implementation of SAC.
Unfortunately I think that the implementation in your code is inaccurate, even though the math in the paper seems to be describing it accurately, namely as expectation. The easiest way to verify that is that setting self.alpha
to zero should yield the expected dqn results, e.g. for q-values, it should be the expected q-value. Taking mean() on a probability weighted tensor will reduce the q-values/policy-values by an amount proportional to the number of actions, which cannot be what you want?
I believe the minimum amount of changes to your code require replacing .mean
with sum
here:
and here as well:
Doing this will create expected_q_values that match the rewards in the environments - so that's relatively easy to check.
I might be totally wrong on everything above of course, so if anyone has ideas/comments on the above, please let me know
EDIT: Previously I had a section here on a different approach to the target_entropy, but upon reflection it seems your target alpha loss implementation is somewhat different from mine, so I gonna leave it out of this ticket. If anyone is interested in the future, I can happily paste my version.
Thanks a lot. Do you want to create a pull request with those changes and i will merge it?
On Wed, May 27, 2020 at 04:02, NikEyX notifications@github.com wrote:
So I fiddled around more with this, recreated your implementation and compared it to a Gumbel-Softmax implementation of SAC.
Unfortunately I think that the implementation in your code is inaccurate, even though the math in the paper seems to be describing it accurately, namely as expectation. The easiest way to verify that is that setting self.alpha to zero should yield the expected dqn results, e.g. for q-values, it should be the expected q-value. Taking mean() on a probability weighted tensor will reduce the q-values/policy-values by an amount proportional to the number of actions, which cannot be what you want?
I believe the minimum amount of changes to your code require replacing .mean with sum here:
and here as well:
Doing this will create expected_q_values that match the rewards in the environments - so that's relatively easy to check.
Lastly, I believe that this changes the entropy_target. In your implementation you used:
I am not sure where the 0.98 is coming from - perhaps you want to keep it slightly below the maximum theoretical entropy, but regardless, given the above changes I think it needs to be divided further by the number of actions:
self.target_entropy = -np.log(1.0 / n_actions) / n_actions
In my few empirical exercises it seems to stabilize the target entropy between 0 and 1.
I might be totally wrong on everything above of course, so if anyone has ideas/comments on the above, please let me know
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/issues/54#issuecomment-634398732, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGJAGA5TZYYMOJFBUL57K2DRTR7DDANCNFSM4M4U2QQQ .
I think it's probably better if someone else does that and verifies it independently. Thanks for the offer though :)
Hello,
I had to implement this exact algorithm for a school project. I can confirm that changing min_qf_next_target = min_qf_next_target.mean(dim=1).unsqueeze(-1)
to min_qf_next_target = min_qf_next_target.sum(dim=1).unsqueeze(-1)
as well as
policy_loss = policy_loss.mean()
to
policy_loss = policy_loss.sum(dim=1).mean()
fixed the whole algorithm on my side. I observed the policy to converge toward the uniform distribution (before the change). These two changes saved my day.
Thanks !
@NikEyX , could you paste your version of self.alpha, please? Actually, I am also at a loss about the initialization of alpha.
HI.
I made PullRequest #60 which fixed actor's and critic's losses. I tested it using CartPole.py and saw the training converged more stably.
Thanks :)
@ku2482 ,@p-christ,
First, thanks for the pr made by @ku2482 , but after testing with CartPole
,there is an issue I want to mention:
When training more than 150 episodes, the alpha learned would be very large, personally, I've see 7000
over 500 episodes!
In addition, I found that target_entropy
is important to stability of alpha---If it is set like -0.1
, 0.2
, it is ok, but if it is round 0.68
, alpha would diverge for this env.
I don't know what is wrong, since the issue doesn't occur with continuous version.
And at present , I work around it by setting self.alpha = torch.clamp(self.alpha, min=0.0, max=1)
after updating alpha.
I have observed the same behavior. I assume that the cause is target entropy (0.98 * maximum) is too large when |A|=2, and the policy would never reach it so that alpha is always getting larger.
@p-christ What do you think?
Yeah that is probably why. If you treat 0.98 as a hyperparameter and make it smaller then hopefully it will solve your problem, let us know if it works!
On Thu, Oct 1, 2020 at 13:01, Toshiki Watanabe notifications@github.com wrote:
I have observed the same behavior. I assume that the cause is target entropy (0.98 * maximum) is too large when |A|=2, and the policy would never reach it so that alpha is always getting larger.
@p-christ https://github.com/p-christ What do you think?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/issues/54#issuecomment-702086391, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGJAGA4QCGCDPBWV7FF6VR3SIRVRVANCNFSM4M4U2QQQ .
@p-christ Thank you for your replay :)
@dbsxdbsx I've tested on CartPole-v0 with various target entropy before. At least for CartPole-v0, 0.8 or 0.7 * maximum seems fine.
Thanks.
@p-christ @ku2482 , thanks for your replies. Hopefully, it is not an algorithm issue.
Hi Petros,
in your SAC_discrete code you are using the following in
SAC_Discrete.py
:So it looks like you're effectively taking the mean of the probability-weighted Q-values in the last line? However, the weights are already provided by
action_probabilities
, so instead of doing.mean(dim=1)
it should be.sum(dim=1)
in order to give you the proper expectation of the next q value for each item in the batch. Or what am I missing here?Running this on LunarLander I also noted that the automatic entropy adjustment doesn't seem to work well. The original SAC paper mentions there is no "upper bound" in the calculation since the value should be tightly coupled to the constrained optimization problem. However, in your implementation the
self.alpha
value can go very high! I would have expected this value to be close / lower than the initial target_entropy level, which in the case of LunarLander is-log(1/4)*0.98 = 1.35
, but I've seen it go to over 20! Do you have an explanation for that by chance?Thanks for your work, it's very appreciated btw