rail-berkeley / softlearning

Softlearning is a reinforcement learning framework for training maximum entropy policies in continuous domains. Includes the official implementation of the Soft Actor-Critic algorithm.
https://sites.google.com/view/sac-and-applications
Other
1.2k stars 241 forks source link

Question on initialization of alpha and entropy #149

Closed dbsxdbsx closed 4 years ago

dbsxdbsx commented 4 years ago

Question1: From here heuristic_target_entropy, I see the initialization of alpha related to action_dim, I don't figure out why make it related to action_dim. Theoretically, is it also workable by just set target_entropy a hard code number, like 0.1 (practically, it seems to be working, but I am not sure)?

Question2: According to your paper of SAC version2, entropy of a state action should NEVER lower than target_entropy, but during training, after each learning round, I found that the entropy of a state action pair would sometimes lower than target_entropy! pseudo code is like this (I use pytorch):

  alpha_loss = torch.tensor(0.).to(self.device)
  alpha_tlogs = torch.tensor(self.alpha)  # For TensorboardX logs

  for each_ele in -log_pi:
        if each_ele < self.target_entropy:
             print("error,-logpi<target_entropy!!!!")

Does it mean I coded it wrong?

Question3: Alpha , sometimes would go higher than 1 during learning, is it correct?

hartikainen commented 4 years ago

Regarding question 1: The automatic temperature tuning presented in [1] requires us to choose some entropy lower bound. The value we choose based on the action dimension is actually not the initial value for temperature, but rather the target value for our entropy. There's no task-independent target entropy value that is always guaranteed to work, but empirically it seems that using the negative number of action dimensions works well, which is why we default to such heuristic. This is basically saying that the higher-dimensional our action space is the higher the entropy value should be, which hopefully makes sense intuitively. Let me know if it doesn't.

For many tasks, there's quite a wide range of entropy values that work, so in your case, using a number like 0.1 could work well. The heuristic is there just as a default value so that you don't necessarily have to know anything about the task at hand in order to get it running in the first place.

Regarding question 2: The way the temperature is learned considers the expected entropy, i.e. the constraint is taken in expectation over the states. Thus it's not unexpected to see arbitrary entropy values for a single state. The reason we want this to happen is that there might be states where we want the policy to be of very low entropy whereas some other states might permit much higher entropy, and so the objective only considers the expectation. Hopefully that makes sense!

[1] Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P. and Levine, S., 2018. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905.

dbsxdbsx commented 4 years ago

@hartikainen, thanks for your answer. And what about question 3? And frankly, I got some value very high like 20 or above when implementing SAC with discrete action like here.

hartikainen commented 4 years ago

Ah, apologies, I forgot to address question 3. Reasonable values for entropy can be anything between (-inf, action_dim * log(2)] and, as the temperature (alpha) is dependent on the reward scale it can basically take any positive value.

I don't have much experience with discrete SAC. But in that case the entropy is always non-negative but the temperature could still be anything depending on the reward.

lidongke commented 3 years ago

First,you said that "This is basically saying that the higher-dimensional our action space is the higher the entropy value should be",but in your "heuristic_target_entropy", the target_entropy will getting smaller and smaller with the action_dim (target_entropy = -1 for action dimensional size = 1,target_entropy = -2 for action dimensional size = 2)? Second,you said that"Reasonable values for entropy can be anything between (-inf, action_dim log(2)]",why the upper bound is action_dim log(2)? I think that the differential entropy's upper bound for continupus space is inf?

@hartikainen @dbsxdbsx

dbsxdbsx commented 3 years ago

@lidongke , I am not quit familiar with the math part of SAC. But practically, I could say: First, the target_entropy will getting smaller and smaller, target_entropy is not changeable, and it is worked as a threhold. I set this value like this:

        self.target_entropy = 0.2  #-np.log((1.0 / self.act_dim)) * 0.98

You can also use the one which I commented out, just don't let this value be set as a negative value, or it would be meaningless.

In additon, what is changeable is parameter alpha. According to @hartikainen said, when using reinforcement learning with function approximation, it is possible that alpha could be LOWER than target_entropy.

Second , I agree on I think that the differential entropy's upper bound for continupus space is inf?. And practically, I found the alpha would go as large as like say, 1000 with some specific initialization of target_entropy in discrete_action version! For more detail, see here. I couldn't figure it out----It seems to be a math issue.Is it also possible for continuous_action version? I have no idea. Anyway, what I did to work around it is to clip it, like :

    self.alpha = torch.clamp(self.log_alpha.exp(),
                    min=self.target_entropy,
                    max=1)

So that alpha would finally drop in the reasonable range.

Just another thing to mention, I think you should realize that target_entropy is always used as a threhold (no matter for discrete or continuous action), and with which we want to prevent parameter alpha from being lower than it (though sometimes it does happen). And when there is no restriction, alpha would be in range [0,inf].

lidongke commented 3 years ago

@dbsxdbsx "the target_entropy will getting smaller and smaller " i mean that different task has different target value, i know the target_entropy is not changeable ,but @hartikainen said "This is basically saying that the higher-dimensional our action space is the higher the entropy value should be".I can't understand this, i found that higher-dimensional aciton space will get smaller target entropy value(target_entropy = -1 for action dimensional size = 1,target_entropy = -2 for action dimensional size = 2)

dbsxdbsx commented 2 years ago

@hartikainen ,

Ah, apologies, I forgot to address question 3. Reasonable values for entropy can be anything between (-inf, action_dim * log(2)] and, as the temperature (alpha) is dependent on the reward scale it can basically take any positive value.

I don't have much experience with discrete SAC. But in that case the entropy is always non-negative but the temperature could still be anything depending on the reward.

What does a negative value mean in context of continuous action? And why the upper bound is action_dim * log(2)? And why the lower bound should be 0 but that of could be -inf?

Sorry for these questions, maybe I still don't get the true meaning of entropy here.