Differences in SAC between LAC_TF2_GRAPH and SAC Haarnoja

This issue highlights some differences I found when comparing the SAC implementation of Minghoa with the implementation of Haarnoja et al. 2019.

Different alpha optimization

During the comparison, I found that Minghoa optimizes the alpha Lagrange multiplier differently than Haarnoja et al. 2019.

Haarnoja et al. 2019

Haarnoja et al. 2019 optimize alpha as follows:

They first calculate the alpha loss based on alpha (See softlearning/algorithms/sac.py:L254-L255):

Following they then optimize log_alpha based on this loss (See softlearning/algorithms/sac.py:L262-L263:

Finally they then relate log_alpha to alpha.

Minghoas version

Minghoa, on the other hand, does the following:

He calculates the alpha loss based on log_alpha:

https://github.com/rickstaa/LAC_TF2_TORCH_Translation/blob/8150ef056bcbf97d8651fbcaa1de0f91018b08c2/legacy/LAC_ORIGINAL/LAC_ORIGINAL/LAC/LAC_V1.py#L146-L148

Following he optimizes log_alpha based on this loss:

https://github.com/rickstaa/LAC_TF2_TORCH_Translation/blob/8150ef056bcbf97d8651fbcaa1de0f91018b08c2/legacy/LAC_ORIGINAL/LAC_ORIGINAL/LAC/LAC_V1.py#L149-L151

I verified this choice with Minghoa, and he said he copied it from Haarnoja. I, therefore, think he used an old commit in which Haarnoja et al. 2019 were still using log_alpha in the loss function. They initially used log_alpha since it is more numerically stable (see this issue). They, however, reverted this change in #a187972 because it caused confusion. Reverting this, however, introduced a numerical instability (see this issue). This was eventually solved by using alpha in the loss function (to keep it in line with the article) but optimize log_alpha instead of alpha. The currently correct implementation should, therefore be:

@panweihit I went ahead and changed the code in the new versions I left the incorrect implementation in the legacy code as a reference. Related to this, I propose a change to the labda_loss. We use the following in our current implementation:

https://github.com/rickstaa/LAC_TF2_TORCH_Translation/blob/8150ef056bcbf97d8651fbcaa1de0f91018b08c2/legacy/LAC_ORIGINAL/LAC_ORIGINAL/LAC/LAC_V1.py#L145

It might, however, be more clear to use lambda instead of log_labda as this is more in line with the formulas depicted in Han et al. 2020.

While doing so, we, of course, have to make a quick note explaining why we optimize log_labda/log_alpha instead of lambda/alpha. We can then point to the issue of Haarnoja et al. 2019.

Different Critic optimization

I found two differences in the way Haarnoja and Minghoa optimize the critic.

Critic loss function

Haarnoja et al 2019, spinningup and stable-baselines use the following formula for calculating the critic loss:

Minghoa, however, omits the 0.5 multipliers when calculating the Critic loss (see Actor-critic-with-stability-guarantee/LAC/SAC_cost.py#L147-L148:

I think omitting the 0.5 does not influence the result as it is mainly added to make the derivation easier (this is also explained in this StackExchange question). This can further be seen by the fact that the Spinningup authors include the 0.5 in their TensorFlow implementation but omit it in their PyTorch version. We are therefore free to include or exclude it. @panweihit maybe it is an idea to include it to prevents possible confusion as to why the code omits it while the multiplication factor is present in formula 7 of Han et al. 2020. Alternatively, I can also add a quick note on the topic while excluding it.

Different Critic bellman backup calculation

There is a difference in how Minghoa and Haarnoja et al. 2019 calculate the Critic target values.

Minghoa uses a target actor to calculate both the log_probabilities and target Q values (See /LAC/SAC_cost.py#L138-L142, /LAC/SAC_cost.py#L135 and /LAC/SAC_cost.py#L96). Haarnoja et al. 2019, stable-baselines and Spinning up don't have a target Actor and therefore use the main actor for calculating these values. This is in line with equation 6 of Haarnoja et al. 2019. I, therefore, think Minghoa re-used the entropy that is used in the LAC algorithm, in which we did need a target network when he was implementing the SAC algorithm. I do not know exactly if using the target Actor (which follows the main actor using an exponential moving average with a decay factor of 0.995)) does influence the results. On the one hand, it could cause the SAC algorithm to explore more and therefore get less stuck in local optimums. On the other hand, it could lead to lower data efficiency. In the end, however, both implementations should converge to the same result. @panweihit Let me know what you think. Nevertheless, I propose we stick with Haarnoja's implementation to keep our code consistent with the Literature.

Different optimization order

The Optimization order differs between Minghoa and Haarnoja et al. 2019.

Minghoas optimization order:

In Minghoas version as he uses tf1.0 the optimization order is not defined (see this question) except for target update which is executed before optimizing the critic due to the specified control dependency (see /LAC/SAC_cost.py#L136.

Haarnoja et al. 2019

Harnoja et al. 2019 in their old code also don't define the optimization order. In the new code they, however, use the following order:

Update Critic
Update Actor
Update alpha
Update target network

In contrast to Minghoa Haarnoja therefore applies the target update after all the other parameters have been optimized (see softlearning/algorithms/sac.py#L292-L298.

If we look into other implementations, (spinup/algos/tf1/sac/sac.py#L198-L205 and stable_baselines/sac/sac.py#L465-L467) we see that they use the convention used by Haarnoja et al. 2019. I verified this with Minghoa, and he said the choice he made was arbitrary. @panweihit I, therefore, propose that we use the convention of Haarnoja et al. 2019 in the new code. I don't think changing the other would matter much, but I will do some quick tests.

Other differences

Actor loss function

Some SAC implementations use the min Q_target (Minghoa's version, spinningup and Haarnoja et al.2019) while calculating the actor loss while others (stable-baselines) use for example Q1. This was discussed in this issue and did not influence the results. We can therefore stick to using the min Q_target as it is equal to the original implementation of Haarnoja et al. 2019.

Different network output activation functions

In https://github.com/rickstaa/stable-learning-control/issues/97 I also found that there is a difference between Minghoa and the other implementations regarding the Gaussian actor and critic Output activation function.

TODOS:

[x] Validate whether changing log_alpha to alpha gives the same results.
- [x] Doesn't seem to matter for the end result.
[x] Validate whether changing the optimization order changes the result.
- [x] Doesn't seem to matter for the end result.
[x] Validate whether removing or adding the 0.5 constant does not influence the result.
- [x] Doesn't seem to matter for the end result.
[x] Validate whether using the target Actor for the entropy calculation changes the result.
- [x] This does not influence the end result.
[x] Validate whether using a different output activation function gives a different result.
- [x] Doesn't seem to matter for the ability of the agent to train. Might still influence the performance.

rickstaa / stable-learning-control