Closed rickstaa closed 4 years ago
The test debug scripts were added in 018164d. They can be found in the sandbox folder:
Below you will find the results of the comparison between the learn
learning function of the Graph test script and Eager test script.
Let's first introduce the symbols I use in the tables of this bug report:
When we compare the forward passes through all the networks we get the following result:
Variable | Eager | Graph |
---|---|---|
a | :heavy_check_mark: | :heavy_check_mark: |
a_ | :heavy_check_mark: | :heavy_check_mark: |
lyaa | :heavy_check_mark: | :heavy_check_mark: |
a_det | :heavy_check_mark: | :heavy_check_mark: |
adet | :heavy_check_mark: | :heavy_check_mark: |
lya_adet | :heavy_check_mark: | :heavy_check_mark: |
log_pis | :heavy_check_mark: | :heavy_check_mark: |
logpis | :heavy_check_mark: | :heavy_check_mark: |
lya_logpis | :heavy_check_mark: | :heavy_check_mark: |
l | :heavy_check_mark: | :heavy_check_mark: |
l_ | :heavy_check_mark: | :heavy_check_mark: |
lyal | :heavy_check_mark: | :heavy_check_mark: |
Here in the Graph script, we can use multiple points to compare the values. First, let's compare the result of running the diagnostic command:
After this, we can look at what happens if we run the self.opt command:
Variable | Eager | Graph |
---|---|---|
l_delta | :heavy_check_mark: | :heavy_check_mark: |
labda_loss | -0.00041095208 | -0.00041095208 |
alpha_loss | -0.03361463 | -0.033614628 |
l_target | :heavy_check_mark: | :heavy_check_mark: |
l_error | 0.27748555 | 0.27748555 |
a_loss | :x: -1.2882228 | :x: -1.3716649 |
entropy | 1.2604759 | 1.2715712 |
l_ | :white_check_mark: | :white_check_mark: |
l | :heavy_check_mark: | :heavy_check_mark: |
a_ | :heavy_check_mark: | :heavy_check_mark: |
lyaa | :heavy_check_mark: | :heavy_check_mark: |
The numbers are very similar meaning that both the values and the formulas are correct.
:bug: While trying to do is using the code of Minghoa, I noticed that all the numbers are off. This is, however, caused by the way Minghoa's retrieves the diagnostics. He currently uses the following commands:
We, therefore, get the loss values etc. after the network is already optimised. It might make more sense to retrieve the values before and after the optimization.
# Compute diagnostics
diagnostics_before = self.sess.run(self.diagnostics, feed_dict)
# Run optimization and return diagnostics
diagnostics_during = self.sess.run([self.opt, self.diagnostics], feed_dict)[1]
# Diagnostics after
diagnostics_after = self.sess.run(self.diagnostics, feed_dict)
# Retrieve diagnostic variables from the optimization
return diagnostics_before
return diagnostics_during
return diagnostics_after
If would be even better to retrieve them during the optimization but this is not possible since the order of graph execution is not predetermined in TensorFlow 1.0
return self.sess.run([self.opt, self.diagnostics], feed_dict)[1]
After I did this I got the following results:
Variable | Eager | Graph |
---|---|---|
l_delta | :heavy_check_mark: | :heavy_check_mark: |
labda_loss | :heavy_check_mark: | :heavy_check_mark: |
alpha_loss | -0.03361463 | -0.033614628 |
l_target | :heavy_check_mark: | :heavy_check_mark: |
l_error | -0.033614628 | 0.27748424 |
a_loss | :x: -1.2882228 | :x: -1.3716649 |
entropy | :x: 1.2604759 | :x: 1.3446307 |
log_pis | :heavy_check_mark: | :heavy_check_mark: |
logpis | :heavy_check_mark: | :heavy_check_mark: |
lya_logpis | :heavy_check_mark: | :heavy_check_mark: |
l | :heavy_check_mark: | :heavy_check_mark: |
l_ | :white_check_mark: | :white_check_mark: |
lyal | :heavy_check_mark: | :heavy_check_mark: |
a | :white_check_mark: | :white_check_mark: |
a_ | :white_check_mark: | :white_check_mark: |
lyaa | :heavy_check_mark: | :heavy_check_mark: |
Most of the values look good. The a_loss difference, however, looks small but we can not say yet if it is significant. To do this we need to make the network smaller to decrease the proportion of small numerical differences.
Let's compare the gradients after one epoch for the following simple network structure:
NETWORK_STRUCTURE = {
# "critic": [128, 128],
"critic": [6, 6],
# "actor": [64, 64],
"actor": [6, 6],
} # The network structure of the agent.
Diagnostics before
Variable | Eager | Graph |
---|---|---|
l_delta | -0.9528182 | -0.9528182 |
labda_before | 0.99 | 0.99 |
alpha_before | 0.99 | 0.99 |
labda_loss | -0.009576134 | -0.009576134 |
alpha_loss | -0.032539286 | -0.032539286 |
l_target | :wavy_dash: | :wavy_dash: |
a_loss | -2.1685486 | -2.1685486 |
l_error | 3.0604205 | 3.0604205 |
entropy | 1.2376348 | 1.2376348 |
From this we can conclude the following:
Diagnostics during
Order of execution is not determined so we can not say when in the optimization process the variables were retrieved.
Variable | Eager | Graph |
---|---|---|
l_delta | -1.0152079 | -1.0152079 |
labda_new | 0.989901 | 0.989901 |
alpha_new | 0.989901 | 0.989901 |
labda_loss | -0.010203171 | -0.010203171 |
alpha_loss | -0.033133637 | -0.033133633 |
l_target | :heavy_check_mark: | :heavy_check_mark: |
l_error | 3.1166785 | 3.1166794 |
a_loss | :x: -2.2905588 | :x: -2.2888603 |
entropy | 1.2987194 | 1.2987194 |
Diagnostics after
Variable | Eager | Graph |
---|---|---|
l_delta | -0.9528182 | -1.1385257 |
labda_new | 0.989901 | 0.989901 |
alpha_new | 0.989901 | 0.989901 |
labda_loss | -0.011554972 | -0.011556407 |
alpha_loss | -0.032539286 | -0.03346716 |
l_target | :heavy_check_mark: | :heavy_check_mark: |
l_error | 3.248716 | 3.2487123 |
a_loss | :x: -2.4128022 | :x: -2.411079 |
entropy | 1.2990335 | 1.2987194 |
Eager:
1.0152079
Graph:
1.0152079
Eager:
3.296772
Graph:
3.2967722
Eager:
[[ 5.2500919e-02 1.1620135e-04 -3.0815847e-02 1.6012463e-01
0.0000000e+00 4.8186995e-02]
[ 8.9347973e-02 -4.9473706e-04 -3.4921229e-02 2.1411180e-01
0.0000000e+00 3.2286722e-02]]
Graph:
[[ 0.0754362 0.00113005 -0.04106913 0.22241642 0. 0.06535622]
[ 0.11783987 -0.00162632 -0.05778477 0.19093093 0. 0.03070358]]
Eager:
[ 0.11513847 -0.00067456 -0.04952521 0.30863664 0. 0.0613342 ]
Graph:
[ 0.16016103 0.00188379 -0.07841549 0.37294954 0. 0.09540796]
Eager:
print(a_grads[2].numpy())
[[-0.02456507 0. -0.00893248 0.04531651 0. -0.00417774]
[-0.00039744 0. 0. 0.00834919 0. 0. ]
[-0.13620663 0. -0.01176214 0.21431652 0. -0.00474387]
[-0.05777049 0. -0.00197002 0.09348982 0. -0.00054915]
[ 0. 0. 0. 0. 0. 0. ]
[-0.00427895 0. 0. 0.02306459 0. 0. ]]
Graph:
[[-0.02465731 0. 0.00749758 0.02818175 0. 0.00208641]
[-0.00069051 0. 0. 0.02741862 0. 0. ]
[-0.17271051 0. 0.00901626 0.23800313 0. 0.0023673 ]
[-0.0758779 0. 0.0012326 0.12511091 0. 0.00027334]
[ 0. 0. 0. 0. 0. 0. ]
[-0.00482392 0. 0. 0.06110027 0. 0. ]]
Eager:
[-0.19133787 0. -0.02154562 0.30950645 0. -0.01008495]
Graph:
[-0.24081628 0. 0.02007396 0.40998226 0. 0.00488109]
Eager:
print(a_grads[4].numpy())
[[ 1.09453360e-03 -9.82757844e-03]
[ 0.00000000e+00 0.00000000e+00]
[-8.12139129e-04 -1.01391226e-03]
[ 3.15092951e-02 -5.21462001e-02]
[ 0.00000000e+00 0.00000000e+00]
[ 1.35928085e-05 -1.19535880e-05]]
Graph:
[[ 3.5226092e-02 2.2917608e-02]
[ 0.0000000e+00 0.0000000e+00]
[ 2.5648137e-03 7.8825658e-04]
[ 2.1228375e-01 9.6085809e-02]
[ 0.0000000e+00 0.0000000e+00]
[-2.2931804e-06 -1.1150328e-06]]
Eager:
[ 0.10589228 -0.1454632 ]
Graph:
[0.76495534 0.40272963]
Eager:
[[ 1.6550789e-02 -9.0415258e-04]
[ 0.0000000e+00 0.0000000e+00]
[ 1.5210815e-03 4.6642643e-04]
[ 1.1167094e-01 1.3123514e-02]
[ 0.0000000e+00 0.0000000e+00]
[ 2.0352798e-05 7.2834946e-06]]
Graph:
[[ 2.1130934e-02 5.6321733e-03]
[ 0.0000000e+00 0.0000000e+00]
[ 1.3668446e-03 1.6134559e-03]
[ 1.5186840e-01 3.3842545e-02]
[ 0.0000000e+00 0.0000000e+00]
[-5.4714778e-06 -6.8443464e-06]]
Eager:
[0.40362728 0.07397197]
Graph:
[0.5506741 0.15411618]
Eager:
[[ 2.9867558 2.2883155 2.1625774 -0.90700424 0. 0.0518655 ]
[ 3.1527486 2.3180034 2.3671355 -1.0203922 0. 0.005802 ]]
Graph:
[[ 2.9867563 2.2883155 2.1625774 -0.90700424 0. 0.05186551]
[ 3.1527488 2.3180037 2.3671355 -1.0203921 0. 0.005802 ]]
Eager:
[[ 3.1506193 2.4131808 2.281676 -0.94762385 0. 0.05612363]
[ 3.3178866 2.4652028 2.4696414 -1.0560501 0. 0.012898 ]]
Graph:
[[ 3.150619 2.4131804 2.2816758 -0.94762385 0. 0.05612364]
[ 3.3178868 2.4652026 2.4696414 -1.05605 0. 0.01289801]]
Eager:
[[ 4.8441653 3.64106 3.5664701 -1.4816256 0. 0.0633439]]
Graph:
[[ 4.8441663 3.6410596 3.5664706 -1.4816254 0. 0.06334392]]
Eager:
[[5.6584263e+00 2.8719792e+00 3.6909237e+00 4.2060099e+00 2.4260844e-03
2.5478518e-03]
[3.3237767e+00 1.6944232e+00 2.1745667e+00 2.4543619e+00 1.6249337e-03
1.7065422e-03]
[7.8306751e+00 3.9162765e+00 5.0567374e+00 5.9484262e+00 1.7874738e-03
1.8767815e-03]
[1.4201217e+00 7.5646693e-01 9.5715249e-01 9.7712606e-01 1.6036956e-03
1.6844459e-03]
[0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00]
[3.0412067e-02 1.9115413e-02 1.7548755e-02 6.4070537e-03 1.0267831e-03
1.0789779e-03]]
Graph:
[[5.6584263e+00 2.8719795e+00 3.6909235e+00 4.2060099e+00 2.4260848e-03
2.5478525e-03]
[3.3237765e+00 1.6944231e+00 2.1745663e+00 2.4543619e+00 1.6249340e-03
1.7065426e-03]
[7.8306746e+00 3.9162760e+00 5.0567374e+00 5.9484258e+00 1.7874741e-03
1.8767819e-03]
[1.4201216e+00 7.5646687e-01 9.5715231e-01 9.7712600e-01 1.6036959e-03
1.6844464e-03]
[0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00]
[3.0412070e-02 1.9115414e-02 1.7548757e-02 6.4070541e-03 1.0267833e-03
1.0789782e-03]]
Eager:
[4.9451823e+00 2.4973552e+00 3.2077954e+00 3.6949959e+00 2.5034877e-03
2.6265378e-03]
Graph:
[4.9451823e+00 2.4973550e+00 3.2077959e+00 3.6949949e+00 2.5034882e-03
2.6265385e-03]
There are two things that are strange notice about the above results:
a_loss
is different while the variables from which the loss is computed are the same given the forward pass. They are however different during the optimization. This might be because in graph mode the order in which the optimizations are executed is different from the eager mode.This might have caused the differences between the current implementations. When looking at the TensorFlow graph it looks like the execution error was as follows:
a_loss --> target update -->l_loss
Unfortunately, this is not sure since I did not find whether the Tensorboard direction is the execution direction. Further, the order of execution in TensorFlow 1.0 is not predetermined when control dependencies are not specified. Meaning even if the graph order is shown in Tensorboard the order of retrieving the diagnostics variables is unknown.
Currently, the order in which the gradients of the actor are most similar is:
1. Calculate l_delta based on current weights
2. Update Critic.
3. Update Labda
4. Update Alpha
5. Update Actor
We can not draw conclusions yet since the changes are too small. I think it is wiser to first check whether the gradients are working.
As stated above we can validate whether the gradients are working by creating a small test script in which we input a known fixed loss to the optimizer while using the same random seeds in the Eager and Graph version. I added these scripts in 35b6cb2. Open sandbox/validate_grad_tf2_eager.py and sandbox/validate_grad_tf2.py to compare the gradients.
:exclamation: The computed gradients are different although the a_loss is equal.
Variable | Eager | Graph |
---|---|---|
batch | :wavy_dash: | :wavy_dash: |
l_delta | -0.9528182 | -0.9528182 |
labda | 0.99 | 0.99 |
alpha | 0.99 | 0.99 |
lyal | :wavy_dash: | :wavy_dash: |
l | :wavy_dash: | :wavy_dash: |
l_error | :wavy_dash: | :wavy_dash: |
entropy | :wavy_dash: | :wavy_dash: |
a_loss | -1084.2743 | -1084.2743 |
Eager:
grad/l1/weights:
[[ 13.408794 2.178104 -11.570398 108.1129 0. 46.277092 ]
[ 30.369755 -0.23067771 -16.505516 87.53128 0. 22.498222]]
grad/l1/bias:
[ 47.802944 0.4118477 -26.826544 185.92018 0. 60.14121]
grad/l2/weights:
[[-12.058103 0. -0.33448434 14.862488 0. -1.8673693 ]
[ -0.30301327 0. 0. 18.134262 0. 0. ]
[-58.740616 0. -1.113349 119.78081 0. -2.120421 ]
[-24.779123 0. -0.4045168 66.25852 0. -0.24545981]
[ 0. 0. 0. 0. 0. 0. ]
[ -3.3655543 0. 0. 41.697414 0. 0.]]
grad/l2/bias:
[-98.14556 0. 1.5422362 206.91672 0. -4.5077815]
grad/mu/weights:
[[ 4.81568050e+00 -2.37027216e+00]
[ 0.00000000e+00 0.00000000e+00]
[-1.84061453e-01 1.09151885e-01]
[ 3.29189644e+01 -2.93549042e+01]
[ 0.00000000e+00 0.00000000e+00]
[-4.64032125e-03 6.64837938e-03]]
grad/mu/bias:
[110.85972 -87.11501]
grad/log_sigma/weights:
[[5.8719816e+00 5.4135699e+00]
[0.0000000e+00 0.0000000e+00]
[2.2099029e-01 3.2953596e-01]
[7.5656143e+01 6.9250507e+00]
[0.0000000e+00 0.0000000e+00]
[6.7091000e-04 9.1812750e-03]]
grad/log_sigma/bias:
[238.21657 47.906483]
Graph:
grad/l1/weights:
[[ 1.1570635 2.2042375 -11.598429 88.867676 0. 48.9553 ]
[ 11.854488 -0.2453581 -16.115313 49.29876 0. 24.385675 ]]
grad/l1/bias:
[ 29.577682 0.16548407 -29.669703 144.31357 0. 64.56108 ]
grad/l2/weights:
[[ -9.905338 0. 4.7838783 3.8737621 0. -3.4614413 ]
[ -0.30002874 0. 0. 18.411116 0. 0.]
[-53.221275 0. 5.242466 79.01278 0. -3.936808]
[-23.196867 0. 0.53559065 51.048943 0. -0.45809162]
[ 0. 0. 0. 0. 0. 0. ]
[ -4.0506964 0. 0. 42.10633 0. 0.]]
grad/l2/bias:
[-98.10234 0. 16.95784 155.67645 0. -8.705088]
grad/mu/weights:
[[ 1.62482891e+01 1.15455017e+01]
[ 0.00000000e+00 0.00000000e+00]
[ 2.01459303e-01 6.31435871e-01]
[ 1.06661575e+02 4.60808792e+01]
[ 0.00000000e+00 0.00000000e+00]
[-3.92461056e-03 9.14150383e-03]]
grad/mu/bias:
[372.89713 178.99115]
grad/log_sigma/weights:
[[7.4654112e+00 1.1838960e+01]
[0.0000000e+00 0.0000000e+00]
[3.2749507e-01 6.5140498e-01]
[8.7392433e+01 4.8770226e+01]
[0.0000000e+00 0.0000000e+00]
[1.4014873e-03 9.9075940e-03]]
grad/log_sigma/bias:
[289.77673 198.32103]
Eager:
grad/l1/w1_s:
[[1595.987 1227.7441 1151.4717 -478.28015 0.
26.529884 ]
[1678.313 1235.5564 1258.6863 -540.0067 0.
3.2418227]]
grad/l1/w1_a:
[[1687.625 1297.5898 1218.1921 -505.27942 0.
27.287476 ]
[1734.5205 1291.3756 1288.8228 -548.12964 0.
5.6431007]]
grad/l1/b1:
[[2585.8762 1950.8157 1897.8624 -787.2867 0. 32.832127]]
grad/l2/weights:
[[3.0088884e+03 1.5339800e+03 1.9696803e+03 2.2246345e+03 1.3444366e+00
1.4125073e+00]
[1.7697002e+03 9.0638995e+02 1.1622632e+03 1.2993762e+03 9.0058529e-01
9.4620943e-01]
[4.1461562e+03 2.0810303e+03 2.6844573e+03 3.1366470e+03 9.8965234e-01
1.0395525e+00]
[7.6887640e+02 4.1175214e+02 5.2082928e+02 5.2538739e+02 8.8927078e-01
9.3442780e-01]
[0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00]
[1.5589552e+01 9.7931461e+00 8.8503208e+00 3.0852060e+00 5.7044661e-01
5.9966487e-01]]
grad/l2/bias:
[2.6395024e+03 1.3400250e+03 1.7193615e+03 1.9598086e+03 1.3959280e+00
1.4653397e+00]
Graph:
grad/l1/w1_s:
[[1595.987 1227.7441 1151.4717 -478.28015 0.
26.529884 ]
[1678.313 1235.5564 1258.6863 -540.0067 0.
3.2418227]]
grad/l1/w1_a:
[[1687.625 1297.5898 1218.1921 -505.27942 0.
27.287476 ]
[1734.5205 1291.3756 1288.8228 -548.12964 0.
5.6431007]]
grad/l1/b1:
[[2585.8762 1950.8157 1897.8624 -787.2867 0. 32.832127]]
grad/l2/weights:
[[3.0088884e+03 1.5339800e+03 1.9696803e+03 2.2246345e+03 1.3444366e+00
1.4125073e+00]
[1.7697002e+03 9.0638995e+02 1.1622632e+03 1.2993762e+03 9.0058529e-01
9.4620943e-01]
[4.1461562e+03 2.0810303e+03 2.6844573e+03 3.1366470e+03 9.8965234e-01
1.0395525e+00]
[7.6887640e+02 4.1175214e+02 5.2082928e+02 5.2538739e+02 8.8927078e-01
9.3442780e-01]
[0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
0.0000000e+00]
[1.5589552e+01 9.7931461e+00 8.8503208e+00 3.0852060e+00 5.7044661e-01
5.9966487e-01]]
grad/l2/bias:
[2.6395024e+03 1.3400250e+03 1.7193615e+03 1.9598086e+03 1.3959280e+00
1.4653397e+00]
As we can see from the above results the gradients that are computed in the eager version are wrong. We have to find out why this is the case.
Moving all the code inside the gradient taper and removing the tf.stop_gradients()
doesn't change the gradients. This is as expected since the gradients of the actor should not be computed w.r.t. alpha, lambda, l and l_.
Does not solve the problem.
@panweihit I opened an issue on the TensorFlow github as it is very unexpected that the gradients are different when the loss functions, random seeds, weights/biases and inputs are equal.
I think I just found something. This problem is fixed when I change the loss function from:
a_loss = GRAD_SCALE_FACTOR * (
labda * l_delta + alpha * tf.reduce_mean(input_tensor=log_pis)
)
To:
a_loss = GRAD_SCALE_FACTOR * (alpha * tf.reduce_mean(input_tensor=log_pis))
Meaning that something goes wrong with the lya_ga_
.
@panweihit After more debugging using the very helpfull tensorboard debugger I found the problem. The gradients of the l_delta were None
and thus the agent was only training based on the log probabilities. This was solved by making sure that all the required gradients were taped. The current working version is as follows. This version makes total sense with the formulas in the article.
@tf.function
def learn(self, LR_A, LR_L, LR_lag, batch):
"""Runs the SGD to update all the optimize parameters.
Args:
LR_A (float): Current actor learning rate.
LR_L (float): Lyapunov critic learning rate.
LR_lag (float): Lyapunov constraint langrance multiplier learning rate.
batch (numpy.ndarray): The batch of experiences.
Returns:
[type]: [description]
"""
# Retrieve state, action and reward from the batch
bs = batch["s"] # state
ba = batch["a"] # action
br = batch["r"] # reward
bterminal = batch["terminal"]
bs_ = batch["s_"] # next state
# Update target networks
self.update_target()
# Get Lypaunov target
a_, _, _ = self.ga_(bs_)
l_ = self.lc_([bs_, a_])
l_target = br + ALG_PARAMS["gamma"] * (1 - bterminal) * tf.stop_gradient(l_)
# Lyapunov candidate constraint function graph
with tf.GradientTape() as l_tape:
# Calculate current lyapunov value
l = self.lc([bs, ba])
# Calculate L_backup
l_error = tf.compat.v1.losses.mean_squared_error(
labels=l_target, predictions=l
)
# Actor loss and optimizer graph
with tf.GradientTape() as a_tape:
# Calculate current value and target lyapunov multiplier value
lya_a_, _, _ = self.ga(bs_)
lya_l_ = self.lc([bs_, lya_a_])
# Calculate Lyapunov constraint function
self.l_delta = tf.reduce_mean(lya_l_ - l + (ALG_PARAMS["alpha3"]) * br)
# Calculate log probability of a_input based on current policy
_, _, log_pis = self.ga(bs)
# Calculate actor loss
a_loss = self.labda * self.l_delta + self.alpha * tf.reduce_mean(log_pis)
# Lagrance multiplier loss functions and optimizers graphs
with tf.GradientTape() as lambda_tape:
labda_loss = -tf.reduce_mean(self.log_labda * self.l_delta)
# Calculate alpha loss
with tf.GradientTape() as alpha_tape:
alpha_loss = -tf.reduce_mean(
self.log_alpha * tf.stop_gradient(log_pis + self.target_entropy)
) # Trim down
# Apply lambda gradients
lambda_grads = lambda_tape.gradient(labda_loss, [self.log_labda])
self.lambda_train.apply_gradients(zip(lambda_grads, [self.log_labda]))
# Apply alpha gradients
alpha_grads = alpha_tape.gradient(alpha_loss, [self.log_alpha])
self.alpha_train.apply_gradients(zip(alpha_grads, [self.log_alpha]))
# Apply actor gradients
a_grads = a_tape.gradient(a_loss, self.ga.trainable_variables)
self.a_train.apply_gradients(zip(a_grads, self.ga.trainable_variables))
# Apply critic gradients
l_grads = l_tape.gradient(l_error, self.lc.trainable_variables)
self.l_train.apply_gradients(zip(l_grads, self.lc.trainable_variables))
# Return results
return (
self.labda,
self.alpha,
l_error,
tf.reduce_mean(tf.stop_gradient(-log_pis)),
a_loss,
)
In the end, this error, combined with the mistake explained in #7 prevented the agent from learning and resulted in me having to perform a lot of debugging in the last three weeks. On a positive note, I now feel very confident in both tf1, tf2, and PyTorch and have a good onderstanding of both the SAC and LAC algorithms :). More importantly, we now have a clean tf1, tf2, and PyTorch solution! :tada:
Fixed in 4b18fb3761bf4f76d66299659e88eb8d756ee69e :tada:
Problem description
The translated code is not working in when eager execution (default in tf2) is enabled. I thas similar behaviours as the PyTorch code. I will, therefore need to compare the two versions to see where they differ.
Possible causes
Debug steps
learn
function. Compare the results for similar input.Problems that were encountered
8 I wasn't able to get the same output from the gaussian actor while the random seed was set.