rickstaa / stable-learning-control

A framework for training theoretically stable (and robust) Reinforcement Learning control algorithms.
https://rickstaa.dev/stable-learning-control
MIT License
6 stars 1 forks source link

Validate LAC/SAC pytorch translation #15

Closed rickstaa closed 4 years ago

rickstaa commented 4 years ago

User story

In order to be able to ship the LAC/SAC pytorch implementation to the team we need to validate whether it gives the same results as the LAC/SAC tensorflow version.

Considerations

Validate SAC (LAC use_lyapunov=False)

Validate LAC

Acceptance criteria

rickstaa commented 4 years ago

Validate SAC (LAC use_lyapunov=False)

The results seem to be equal. We can therefore safely assume the Spinning up SAC implementation is equal to the LAC implementation with use_lyapunov disabled.

Learning parameters

Inference

image

Spinning up

image

LAC (use_lyapunov disabled)

image

rickstaa commented 4 years ago

Validate LAC

It appears that the LAC PyTorch implementation has a higher offset than the SAC and LAC tensorflow implementations:

The performance becomes worse after more training steps:

image

Differences between the pytorch implementation and the one of Minghoa

  1. Minghoa uses log_alpha in the alpha_loss formula (See L116). I use alpha since this is in line with how Harnooja 2019 (see L254) performs automatic temperature tuning. I don't think this should matter much since they are both increasing functions at x-> but maybe there is a good reason Minghoa uses log_alpha.

  2. In the loss_lambda formula, Minghoa also uses log_lambda where I would expect him to use lambda (See L115).

rickstaa commented 4 years ago

Since we did not yet find what causes the difference between the 2 implementations I will do the following:

After that I can also:

rickstaa commented 4 years ago

Double-check hyperparameters

Hyperparameter translation Han vs Mine

Differences that still exist between the two codes

rickstaa commented 4 years ago

Differences that still exist between the (translated) PyTorch LAC and TensorFlow LAC

The SquashedGaussian actor is to complex for a one-to-one translation. I, therefore, had to use the nn.Module class instead of the nn.Sequential class. When doing this I, however, found a small difference between the LAC code and the SAC (spinning up) class.

LAC returns a squashed deterministic action during interference.

In both Minghoas code and the code of Haarnoja et. al 2019 ([see L125])(https://github.com/haarnoja/sac/blob/8258e33633c7e37833cc39315891e77adfbe14b2/sac/policies/gaussian_policy.py#L125)) the (deterministic) clipped_mu which comes from the mu.network() is squashed with the Tanh function. In the spinning up version, this is not done.

SAC version (L49)

mu = self.mu_layer(net_out)
clipped_mu = mu

LAC version (L244)

mu = tf.layers.dense(net_1, self.a_dim, activation= None, name='a', trainable=trainable)
clipped_mu = squash_bijector.forward(mu)
rickstaa commented 4 years ago
rickstaa commented 4 years ago

This issue was fixed and will be shipped with the next release. See #18 for the release report.

rickstaa commented 1 year ago

Differences that still exist between the (translated) PyTorch LAC and TensorFlow LAC

The SquashedGaussian actor is to complex for a one-to-one translation. I, therefore, had to use the nn.Module class instead of the nn.Sequential class. When doing this I, however, found a small difference between the LAC code and the SAC (spinning up) class.

LAC returns a squashed deterministic action during interference.

In both Minghoas code and the code of Haarnoja et. al 2019 ([see L125])(https://github.com/haarnoja/sac/blob/8258e33633c7e37833cc39315891e77adfbe14b2/sac/policies/gaussian_policy.py#L125)) the (deterministic) clipped_mu which comes from the mu.network() is squashed with the Tanh function. In the spinning up version, this is not done.

SAC version (L49)

mu = self.mu_layer(net_out)
clipped_mu = mu

LAC version (L244)

mu = tf.layers.dense(net_1, self.a_dim, activation= None, name='a', trainable=trainable)
clipped_mu = squash_bijector.forward(mu)

I double-checked this again and spinning up does squash the mu (see https://github.com/openai/spinningup/blob/038665d62d569055401d91856abb287263096178/spinup/algos/pytorch/sac/core.py#L64).