rickstaa commented 4 years ago

In this issue, the results of two new architectures DLAC and LSAC are compared with the original LAC algorithm. To do this I will use the oscillator environment. I will also set the Environment and Algorithm seeds to 0.

LAC results

The original LAC algorithm gives the following results:

Good policy

Bad policy

Conculsion

DLAC results

In the double-Lypaunov Actor-Critic, two Lyapunov critics are used instead of one. Following the maximum L, value is used for calculating the actor loss. This is similar to the double-Q trick that is used in the original SAC algorithm.

Results

In the current from the double-Lyapunov Soft actor, the critic is not able to train. I, however, think this is due to an error in the implementation. I will postpone researching this architecture after the Pytorch version is fully ready as in there it is easier to debug.

LSAC

In the Lyapunov Soft Actor-Critic (Couldn't think of a name) contains both a Lyapunov critic and a normal soft critic. Following the results of both these critics are combined in the loss function for the policy:

Results

rickstaa commented 4 years ago

LSAC automatic temperature tuning

Now let's add an additional Lagrance multiplier for the contribution of the Q networks.

Sigma direction investigation

When I implemented the temperature variable (sigma) for the value component of the actor loss function I noticed this Lagrange multiplier (sigma) sometimes increase and sometimes decreases.