Closed rickstaa closed 3 years ago
Now let's add an additional Lagrance multiplier for the contribution of the Q networks.
When I implemented the temperature variable (sigma) for the value component of the actor loss function I noticed this Lagrange multiplier (sigma) sometimes increase and sometimes decreases.
Following lets minimize the following equality constraint:
Good policy
Bad policy
Closed as this is not on the immediate roadmap.
In this issue, the results of two new architectures DLAC and LSAC are compared with the original LAC algorithm. To do this I will use the oscillator environment. I will also set the Environment and Algorithm seeds to 0.
LAC results
The original LAC algorithm gives the following results:
Good policy
Bad policy
Conculsion
DLAC results
In the double-Lypaunov Actor-Critic, two Lyapunov critics are used instead of one. Following the maximum L, value is used for calculating the actor loss. This is similar to the double-Q trick that is used in the original SAC algorithm.
Results
In the current from the double-Lyapunov Soft actor, the critic is not able to train. I, however, think this is due to an error in the implementation. I will postpone researching this architecture after the Pytorch version is fully ready as in there it is easier to debug.
LSAC
In the Lyapunov Soft Actor-Critic (Couldn't think of a name) contains both a Lyapunov critic and a normal soft critic. Following the results of both these critics are combined in the loss function for the policy:
Results