Closed rickstaa closed 4 years ago
The results seem to be equal. We can therefore safely assume the Spinning up SAC implementation is equal to the LAC implementation with use_lyapunov disabled.
It appears that the LAC PyTorch implementation has a higher offset than the SAC and LAC tensorflow implementations:
The performance becomes worse after more training steps:
Minghoa uses log_alpha in the alpha_loss formula (See L116). I use alpha since this is in line with how Harnooja 2019 (see L254) performs automatic temperature tuning. I don't think this should matter much since they are both increasing functions at x-> but maybe there is a good reason Minghoa uses log_alpha.
In the loss_lambda formula, Minghoa also uses log_lambda where I would expect him to use lambda (See L115).
Since we did not yet find what causes the difference between the 2 implementations I will do the following:
use_lyapunov
disables the right parts.After that I can also:
max_global_steps
: steps_per_epoch
* epochs
num_of_trials
: This is the number of random seeds (agents) you train - Does not exist in my implementation I only train 1 agent for the steps_per_epoch
* epochs
. start_of_trail
: This is the start index of the folder in which the random seeds are saved. Example (start_of_trail=4
: ./LAC20200827_0046/4/
,./LAC20200827_0046/5/
ect - Does not exist in my implementation. num_of_evaluation_paths
: The number of rollouts/trajectories that are used in the test run) - In my implementation num_test_episodes
. num_of_training_paths
: Does not exist in the PyTorch implementation. In addition to the ReplayBuffer Han also stores the trajectories for the rollouts. It is these trajectories that are used during the performance evaluation (The values that are printed to the)steps_per_cycle
: This is the number of steps taken before performing the STG update - In my implemenation update_every
.train_per_cycle
: The number of SGD passes to perform with every STG cycle - Doesn't exist in my implementation. I simply locked the ratio of env steps to gradient steps to 1. Meaning after update_every
the SGD will run update_every
times. In Minghoas code this means that after steps_per_cycle
the SGD will be performed train_per_cycle
times.
evaluation_frequency
: Means-End of epoch handling (Save model, test performance and log data)) - In my script steps_per_epoch
!!! Might be confusing !!!adaptive_alpha
: Whether we want to also train the alpha or keep it fixed - In my implementation target_entropy
. This variable can have 3 values. If you supply it with "auto" the algorithm will automatically determine required alpha_target
based on the action space size. If you supply a float the alpha_target
will be set equal to this float. If you supply it with None the alpha will not be trained.The SquashedGaussian actor is to complex for a one-to-one translation. I, therefore, had to use the nn.Module class instead of the nn.Sequential class. When doing this I, however, found a small difference between the LAC code and the SAC (spinning up) class.
In both Minghoas code and the code of Haarnoja et. al 2019 ([see L125])(https://github.com/haarnoja/sac/blob/8258e33633c7e37833cc39315891e77adfbe14b2/sac/policies/gaussian_policy.py#L125)) the (deterministic) clipped_mu
which comes from the mu.network()
is squashed with the Tanh function. In the spinning up version, this is not done.
mu = self.mu_layer(net_out)
clipped_mu = mu
mu = tf.layers.dense(net_1, self.a_dim, activation= None, name='a', trainable=trainable)
clipped_mu = squash_bijector.forward(mu)
This issue was fixed and will be shipped with the next release. See #18 for the release report.
Differences that still exist between the (translated) PyTorch LAC and TensorFlow LAC
The SquashedGaussian actor is to complex for a one-to-one translation. I, therefore, had to use the nn.Module class instead of the nn.Sequential class. When doing this I, however, found a small difference between the LAC code and the SAC (spinning up) class.
LAC returns a squashed deterministic action during interference.
In both Minghoas code and the code of Haarnoja et. al 2019 ([see L125])(https://github.com/haarnoja/sac/blob/8258e33633c7e37833cc39315891e77adfbe14b2/sac/policies/gaussian_policy.py#L125)) the (deterministic)
clipped_mu
which comes from themu.network()
is squashed with the Tanh function. In the spinning up version, this is not done.mu = self.mu_layer(net_out) clipped_mu = mu
mu = tf.layers.dense(net_1, self.a_dim, activation= None, name='a', trainable=trainable) clipped_mu = squash_bijector.forward(mu)
I double-checked this again and spinning up does squash the mu (see https://github.com/openai/spinningup/blob/038665d62d569055401d91856abb287263096178/spinup/algos/pytorch/sac/core.py#L64).
User story
In order to be able to ship the LAC/SAC pytorch implementation to the team we need to validate whether it gives the same results as the LAC/SAC tensorflow version.
Considerations
Validate SAC (LAC use_lyapunov=False)
Validate LAC
Acceptance criteria