I verified this choice with Minghoa, and he said he copied it from Haarnoja. I, therefore, think he used an old commit in which Haarnoja et al. 2019 were still using log_alpha in the loss function. They initially used log_alpha since it is more numerically stable (see this issue). They, however, reverted this change in #a187972 because it caused confusion. Reverting this, however, introduced a numerical instability (see this issue). This was eventually solved by using alpha in the loss function (to keep it in line with the article) but optimize log_alpha instead of alpha. The currently correct implementation should, therefore be:
@panweihit I went ahead and changed the code in the new versions I left the incorrect implementation in the legacy code as a reference. Related to this, I propose a change to the labda_loss. We use the following in our current implementation:
It might, however, be more clear to use lambda instead of log_labda as this is more in line with the formulas depicted in Han et al. 2020.
While doing so, we, of course, have to make a quick note explaining why we optimize log_labda/log_alpha instead of lambda/alpha. We can then point to the issue of Haarnoja et al. 2019.
Different Critic optimization
I found two differences in the way Haarnoja and Minghoa optimize the critic.
I think omitting the 0.5 does not influence the result as it is mainly added to make the derivation easier (this is also explained in this StackExchange question). This can further be seen by the fact that the Spinningup authors include the 0.5 in their TensorFlow implementation but omit it in their PyTorch version. We are therefore free to include or exclude it. @panweihit maybe it is an idea to include it to prevents possible confusion as to why the code omits it while the multiplication factor is present in formula 7 of Han et al. 2020. Alternatively, I can also add a quick note on the topic while excluding it.
Different Critic bellman backup calculation
There is a difference in how Minghoa and Haarnoja et al. 2019 calculate the Critic target values.
Minghoa uses a target actor to calculate both the log_probabilities and target Q values (See /LAC/SAC_cost.py#L138-L142, /LAC/SAC_cost.py#L135 and /LAC/SAC_cost.py#L96). Haarnoja et al. 2019, stable-baselines and Spinning up don't have a target Actor and therefore use the main actor for calculating these values. This is in line with equation 6 of Haarnoja et al. 2019. I, therefore, think Minghoa re-used the entropy that is used in the LAC algorithm, in which we did need a target network when he was implementing the SAC algorithm. I do not know exactly if using the target Actor (which follows the main actor using an exponential moving average with a decay factor of 0.995)) does influence the results. On the one hand, it could cause the SAC algorithm to explore more and therefore get less stuck in local optimums. On the other hand, it could lead to lower data efficiency. In the end, however, both implementations should converge to the same result. @panweihit Let me know what you think. Nevertheless, I propose we stick with Haarnoja's implementation to keep our code consistent with the Literature.
Different optimization order
The Optimization order differs between Minghoa and Haarnoja et al. 2019.
Minghoas optimization order:
In Minghoas version as he uses tf1.0 the optimization order is not defined (see this question) except for target update which is executed before optimizing the critic due to the specified control dependency (see /LAC/SAC_cost.py#L136.
Haarnoja et al. 2019
Harnoja et al. 2019 in their old code also don't define the optimization order. In the new code they, however, use the following order:
Update Critic
Update Actor
Update alpha
Update target network
In contrast to Minghoa Haarnoja therefore applies the target update after all the other parameters have been optimized (see softlearning/algorithms/sac.py#L292-L298.
If we look into other implementations, (spinup/algos/tf1/sac/sac.py#L198-L205 and stable_baselines/sac/sac.py#L465-L467) we see that they use the convention used by Haarnoja et al. 2019. I verified this with Minghoa, and he said the choice he made was arbitrary. @panweihit I, therefore, propose that we use the convention of Haarnoja et al. 2019 in the new code. I don't think changing the other would matter much, but I will do some quick tests.
Other differences
Actor loss function
Some SAC implementations use the min Q_target (Minghoa's version, spinningup and Haarnoja et al.2019) while calculating the actor loss while others (stable-baselines) use for example Q1. This was discussed in this issue and did not influence the results. We can therefore stick to using the min Q_target as it is equal to the original implementation of Haarnoja et al. 2019.
Closing as all tests were performed successfully. I think this issue gives a clear overview of the differences that are found in the SAC version of Minghoa et al 2020 and the other implementations.
This issue highlights some differences I found when comparing the SAC implementation of Minghoa with the implementation of Haarnoja et al. 2019.
Different alpha optimization
During the comparison, I found that Minghoa optimizes the alpha Lagrange multiplier differently than Haarnoja et al. 2019.
Haarnoja et al. 2019
Haarnoja et al. 2019 optimize alpha as follows:
Minghoas version
Minghoa, on the other hand, does the following:
https://github.com/rickstaa/LAC_TF2_TORCH_Translation/blob/8150ef056bcbf97d8651fbcaa1de0f91018b08c2/legacy/LAC_ORIGINAL/LAC_ORIGINAL/LAC/LAC_V1.py#L146-L148
Following he optimizes log_alpha based on this loss:
https://github.com/rickstaa/LAC_TF2_TORCH_Translation/blob/8150ef056bcbf97d8651fbcaa1de0f91018b08c2/legacy/LAC_ORIGINAL/LAC_ORIGINAL/LAC/LAC_V1.py#L149-L151
I verified this choice with Minghoa, and he said he copied it from Haarnoja. I, therefore, think he used an old commit in which Haarnoja et al. 2019 were still using log_alpha in the loss function. They initially used log_alpha since it is more numerically stable (see this issue). They, however, reverted this change in #a187972 because it caused confusion. Reverting this, however, introduced a numerical instability (see this issue). This was eventually solved by using alpha in the loss function (to keep it in line with the article) but optimize log_alpha instead of alpha. The currently correct implementation should, therefore be:
@panweihit I went ahead and changed the code in the new versions I left the incorrect implementation in the legacy code as a reference. Related to this, I propose a change to the labda_loss. We use the following in our current implementation:
https://github.com/rickstaa/LAC_TF2_TORCH_Translation/blob/8150ef056bcbf97d8651fbcaa1de0f91018b08c2/legacy/LAC_ORIGINAL/LAC_ORIGINAL/LAC/LAC_V1.py#L145
It might, however, be more clear to use lambda instead of log_labda as this is more in line with the formulas depicted in Han et al. 2020.
While doing so, we, of course, have to make a quick note explaining why we optimize log_labda/log_alpha instead of lambda/alpha. We can then point to the issue of Haarnoja et al. 2019.
Different Critic optimization
I found two differences in the way Haarnoja and Minghoa optimize the critic.
Critic loss function
Haarnoja et al 2019, spinningup and stable-baselines use the following formula for calculating the critic loss:
Minghoa, however, omits the 0.5 multipliers when calculating the Critic loss (see Actor-critic-with-stability-guarantee/LAC/SAC_cost.py#L147-L148:
I think omitting the 0.5 does not influence the result as it is mainly added to make the derivation easier (this is also explained in this StackExchange question). This can further be seen by the fact that the Spinningup authors include the 0.5 in their TensorFlow implementation but omit it in their PyTorch version. We are therefore free to include or exclude it. @panweihit maybe it is an idea to include it to prevents possible confusion as to why the code omits it while the multiplication factor is present in formula 7 of Han et al. 2020. Alternatively, I can also add a quick note on the topic while excluding it.
Different Critic bellman backup calculation
There is a difference in how Minghoa and Haarnoja et al. 2019 calculate the Critic target values.
Minghoa uses a target actor to calculate both the log_probabilities and target Q values (See /LAC/SAC_cost.py#L138-L142, /LAC/SAC_cost.py#L135 and /LAC/SAC_cost.py#L96). Haarnoja et al. 2019, stable-baselines and Spinning up don't have a target Actor and therefore use the main actor for calculating these values. This is in line with equation 6 of Haarnoja et al. 2019. I, therefore, think Minghoa re-used the entropy that is used in the LAC algorithm, in which we did need a target network when he was implementing the SAC algorithm. I do not know exactly if using the target Actor (which follows the main actor using an exponential moving average with a decay factor of 0.995)) does influence the results. On the one hand, it could cause the SAC algorithm to explore more and therefore get less stuck in local optimums. On the other hand, it could lead to lower data efficiency. In the end, however, both implementations should converge to the same result. @panweihit Let me know what you think. Nevertheless, I propose we stick with Haarnoja's implementation to keep our code consistent with the Literature.
Different optimization order
The Optimization order differs between Minghoa and Haarnoja et al. 2019.
Minghoas optimization order:
In Minghoas version as he uses tf1.0 the optimization order is not defined (see this question) except for target update which is executed before optimizing the critic due to the specified control dependency (see /LAC/SAC_cost.py#L136.
Haarnoja et al. 2019
Harnoja et al. 2019 in their old code also don't define the optimization order. In the new code they, however, use the following order:
In contrast to Minghoa Haarnoja therefore applies the target update after all the other parameters have been optimized (see softlearning/algorithms/sac.py#L292-L298.
If we look into other implementations, (spinup/algos/tf1/sac/sac.py#L198-L205 and stable_baselines/sac/sac.py#L465-L467) we see that they use the convention used by Haarnoja et al. 2019. I verified this with Minghoa, and he said the choice he made was arbitrary. @panweihit I, therefore, propose that we use the convention of Haarnoja et al. 2019 in the new code. I don't think changing the other would matter much, but I will do some quick tests.
Other differences
Actor loss function
Some SAC implementations use the min Q_target (Minghoa's version, spinningup and Haarnoja et al.2019) while calculating the actor loss while others (stable-baselines) use for example Q1. This was discussed in this issue and did not influence the results. We can therefore stick to using the min Q_target as it is equal to the original implementation of Haarnoja et al. 2019.
Different network output activation functions
In https://github.com/rickstaa/stable-learning-control/issues/97 I also found that there is a difference between Minghoa and the other implementations regarding the Gaussian actor and critic Output activation function.
TODOS: