p-christ / Deep-Reinforcement-Learning-Algorithms-with-PyTorch

PyTorch implementations of deep reinforcement learning algorithms and environments
MIT License
5.66k stars 1.2k forks source link

Mean of expectation in SAC_discrete.py possibly wrong? #54

Closed NikEyX closed 4 years ago

NikEyX commented 4 years ago

Hi Petros,

in your SAC_discrete code you are using the following in SAC_Discrete.py:

min_qf_next_target = action_probabilities * (torch.min(qf1_next_target, qf2_next_target) - self.alpha * log_action_probabilities)
min_qf_next_target = min_qf_next_target.mean(dim=1).unsqueeze(-1)

So it looks like you're effectively taking the mean of the probability-weighted Q-values in the last line? However, the weights are already provided by action_probabilities, so instead of doing .mean(dim=1) it should be .sum(dim=1) in order to give you the proper expectation of the next q value for each item in the batch. Or what am I missing here?

Running this on LunarLander I also noted that the automatic entropy adjustment doesn't seem to work well. The original SAC paper mentions there is no "upper bound" in the calculation since the value should be tightly coupled to the constrained optimization problem. However, in your implementation the self.alpha value can go very high! I would have expected this value to be close / lower than the initial target_entropy level, which in the case of LunarLander is -log(1/4)*0.98 = 1.35, but I've seen it go to over 20! Do you have an explanation for that by chance?

Thanks for your work, it's very appreciated btw

NikEyX commented 4 years ago

So I fiddled around more with this, recreated your implementation and compared it to a Gumbel-Softmax implementation of SAC.

Unfortunately I think that the implementation in your code is inaccurate, even though the math in the paper seems to be describing it accurately, namely as expectation. The easiest way to verify that is that setting self.alpha to zero should yield the expected dqn results, e.g. for q-values, it should be the expected q-value. Taking mean() on a probability weighted tensor will reduce the q-values/policy-values by an amount proportional to the number of actions, which cannot be what you want?

I believe the minimum amount of changes to your code require replacing .mean with sum here:

https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/49b5ec054ef34ef77516d2bfa1713b5d254fd84a/agents/actor_critic_agents/SAC_Discrete.py#L72

and here as well:

https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/49b5ec054ef34ef77516d2bfa1713b5d254fd84a/agents/actor_critic_agents/SAC_Discrete.py#L89

Doing this will create expected_q_values that match the rewards in the environments - so that's relatively easy to check.

I might be totally wrong on everything above of course, so if anyone has ideas/comments on the above, please let me know

EDIT: Previously I had a section here on a different approach to the target_entropy, but upon reflection it seems your target alpha loss implementation is somewhat different from mine, so I gonna leave it out of this ticket. If anyone is interested in the future, I can happily paste my version.

p-christ commented 4 years ago

Thanks a lot. Do you want to create a pull request with those changes and i will merge it?

On Wed, May 27, 2020 at 04:02, NikEyX notifications@github.com wrote:

So I fiddled around more with this, recreated your implementation and compared it to a Gumbel-Softmax implementation of SAC.

Unfortunately I think that the implementation in your code is inaccurate, even though the math in the paper seems to be describing it accurately, namely as expectation. The easiest way to verify that is that setting self.alpha to zero should yield the expected dqn results, e.g. for q-values, it should be the expected q-value. Taking mean() on a probability weighted tensor will reduce the q-values/policy-values by an amount proportional to the number of actions, which cannot be what you want?

I believe the minimum amount of changes to your code require replacing .mean with sum here:

https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/49b5ec054ef34ef77516d2bfa1713b5d254fd84a/agents/actor_critic_agents/SAC_Discrete.py#L72

and here as well:

https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/49b5ec054ef34ef77516d2bfa1713b5d254fd84a/agents/actor_critic_agents/SAC_Discrete.py#L89

Doing this will create expected_q_values that match the rewards in the environments - so that's relatively easy to check.

Lastly, I believe that this changes the entropy_target. In your implementation you used:

https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/49b5ec054ef34ef77516d2bfa1713b5d254fd84a/agents/actor_critic_agents/SAC_Discrete.py#L41

I am not sure where the 0.98 is coming from - perhaps you want to keep it slightly below the maximum theoretical entropy, but regardless, given the above changes I think it needs to be divided further by the number of actions:

self.target_entropy = -np.log(1.0 / n_actions) / n_actions

In my few empirical exercises it seems to stabilize the target entropy between 0 and 1.

I might be totally wrong on everything above of course, so if anyone has ideas/comments on the above, please let me know

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/issues/54#issuecomment-634398732, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGJAGA5TZYYMOJFBUL57K2DRTR7DDANCNFSM4M4U2QQQ .

NikEyX commented 4 years ago

I think it's probably better if someone else does that and verifies it independently. Thanks for the offer though :)

misterguick commented 4 years ago

Hello,

I had to implement this exact algorithm for a school project. I can confirm that changing min_qf_next_target = min_qf_next_target.mean(dim=1).unsqueeze(-1) to min_qf_next_target = min_qf_next_target.sum(dim=1).unsqueeze(-1) as well as policy_loss = policy_loss.mean() to policy_loss = policy_loss.sum(dim=1).mean()

fixed the whole algorithm on my side. I observed the policy to converge toward the uniform distribution (before the change). These two changes saved my day.

Thanks !

dbsxdbsx commented 4 years ago

@NikEyX , could you paste your version of self.alpha, please? Actually, I am also at a loss about the initialization of alpha.

toshikwa commented 4 years ago

HI.

I made PullRequest #60 which fixed actor's and critic's losses. I tested it using CartPole.py and saw the training converged more stably.

Thanks :)

dbsxdbsx commented 4 years ago

@ku2482 ,@p-christ,
First, thanks for the pr made by @ku2482 , but after testing with CartPole,there is an issue I want to mention: When training more than 150 episodes, the alpha learned would be very large, personally, I've see 7000 over 500 episodes! In addition, I found that target_entropy is important to stability of alpha---If it is set like -0.1, 0.2, it is ok, but if it is round 0.68, alpha would diverge for this env.

I don't know what is wrong, since the issue doesn't occur with continuous version. And at present , I work around it by setting self.alpha = torch.clamp(self.alpha, min=0.0, max=1) after updating alpha.

toshikwa commented 4 years ago

I have observed the same behavior. I assume that the cause is target entropy (0.98 * maximum) is too large when |A|=2, and the policy would never reach it so that alpha is always getting larger.

@p-christ What do you think?

p-christ commented 4 years ago

Yeah that is probably why. If you treat 0.98 as a hyperparameter and make it smaller then hopefully it will solve your problem, let us know if it works!

On Thu, Oct 1, 2020 at 13:01, Toshiki Watanabe notifications@github.com wrote:

I have observed the same behavior. I assume that the cause is target entropy (0.98 * maximum) is too large when |A|=2, and the policy would never reach it so that alpha is always getting larger.

@p-christ https://github.com/p-christ What do you think?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/issues/54#issuecomment-702086391, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGJAGA4QCGCDPBWV7FF6VR3SIRVRVANCNFSM4M4U2QQQ .

toshikwa commented 4 years ago

@p-christ Thank you for your replay :)

@dbsxdbsx I've tested on CartPole-v0 with various target entropy before. At least for CartPole-v0, 0.8 or 0.7 * maximum seems fine.

スクリーンショット 2020-10-01 21 51 24

Thanks.

dbsxdbsx commented 4 years ago

@p-christ @ku2482 , thanks for your replies. Hopefully, it is not an algorithm issue.