rickstaa / LAC-TF2-TORCH-translation

Temporary repository to debug what goes wrong during the translation of the LAC algorithm from TF1 to Torch.
0 stars 1 forks source link

Compare tf eager with tf Graph #9

Closed rickstaa closed 4 years ago

rickstaa commented 4 years ago

Problem description

The translated code is not working in when eager execution (default in tf2) is enabled. I thas similar behaviours as the PyTorch code. I will, therefore need to compare the two versions to see where they differ.

Possible causes

Debug steps

Problems that were encountered

rickstaa commented 4 years ago

The test debug scripts were added in 018164d. They can be found in the sandbox folder:

rickstaa commented 4 years ago

Differences found between eager and graph versions

rickstaa commented 4 years ago

Simple debug script comparison

Below you will find the results of the comparison between the learn learning function of the Graph test script and Eager test script.

Nomenclature

Let's first introduce the symbols I use in the tables of this bug report:

Pre-tests

rickstaa commented 4 years ago

Compare the output of lear function (Big network)

Compare results using forward pass through the networks

When we compare the forward passes through all the networks we get the following result:

Variable Eager Graph
a :heavy_check_mark: :heavy_check_mark:
a_ :heavy_check_mark: :heavy_check_mark:
lyaa :heavy_check_mark: :heavy_check_mark:
a_det :heavy_check_mark: :heavy_check_mark:
adet :heavy_check_mark: :heavy_check_mark:
lya_adet :heavy_check_mark: :heavy_check_mark:
log_pis :heavy_check_mark: :heavy_check_mark:
logpis :heavy_check_mark: :heavy_check_mark:
lya_logpis :heavy_check_mark: :heavy_check_mark:
l :heavy_check_mark: :heavy_check_mark:
l_ :heavy_check_mark: :heavy_check_mark:
lyal :heavy_check_mark: :heavy_check_mark:

Compare learning variables after one epoch

Here in the Graph script, we can use multiple points to compare the values. First, let's compare the result of running the diagnostic command:

https://github.com/rickstaa/filter_LAC_tf_rewrite/blob/c64feb4205b50d40d7b36f8d72cbd31e2e8ca202/sandbox/compare_test_tf2.py#L590

After this, we can look at what happens if we run the self.opt command:

https://github.com/rickstaa/filter_LAC_tf_rewrite/blob/c64feb4205b50d40d7b36f8d72cbd31e2e8ca202/sandbox/compare_test_tf2.py#L587

Compare diagnostics output

Variable Eager Graph
l_delta :heavy_check_mark: :heavy_check_mark:
labda_loss -0.00041095208 -0.00041095208
alpha_loss -0.03361463 -0.033614628
l_target :heavy_check_mark: :heavy_check_mark:
l_error 0.27748555 0.27748555
a_loss :x: -1.2882228 :x: -1.3716649
entropy 1.2604759 1.2715712
l_ :white_check_mark: :white_check_mark:
l :heavy_check_mark: :heavy_check_mark:
a_ :heavy_check_mark: :heavy_check_mark:
lyaa :heavy_check_mark: :heavy_check_mark:

The numbers are very similar meaning that both the values and the formulas are correct.

Compare optimization command output

:bug: While trying to do is using the code of Minghoa, I noticed that all the numbers are off. This is, however, caused by the way Minghoa's retrieves the diagnostics. He currently uses the following commands:

https://github.com/rickstaa/filter_LAC_tf_rewrite/blob/c64feb4205b50d40d7b36f8d72cbd31e2e8ca202/sandbox/compare_test_tf2.py#L587-L590

We, therefore, get the loss values etc. after the network is already optimised. It might make more sense to retrieve the values before and after the optimization.

# Compute diagnostics
diagnostics_before = self.sess.run(self.diagnostics, feed_dict)

# Run optimization and return diagnostics
diagnostics_during = self.sess.run([self.opt, self.diagnostics], feed_dict)[1]

# Diagnostics after
diagnostics_after = self.sess.run(self.diagnostics, feed_dict)

# Retrieve diagnostic variables from the optimization
return diagnostics_before
return diagnostics_during
return diagnostics_after

If would be even better to retrieve them during the optimization but this is not possible since the order of graph execution is not predetermined in TensorFlow 1.0

return self.sess.run([self.opt, self.diagnostics], feed_dict)[1]

After I did this I got the following results:

Variable Eager Graph
l_delta :heavy_check_mark: :heavy_check_mark:
labda_loss :heavy_check_mark: :heavy_check_mark:
alpha_loss -0.03361463 -0.033614628
l_target :heavy_check_mark: :heavy_check_mark:
l_error -0.033614628 0.27748424
a_loss :x: -1.2882228 :x: -1.3716649
entropy :x: 1.2604759 :x: 1.3446307
log_pis :heavy_check_mark: :heavy_check_mark:
logpis :heavy_check_mark: :heavy_check_mark:
lya_logpis :heavy_check_mark: :heavy_check_mark:
l :heavy_check_mark: :heavy_check_mark:
l_ :white_check_mark: :white_check_mark:
lyal :heavy_check_mark: :heavy_check_mark:
a :white_check_mark: :white_check_mark:
a_ :white_check_mark: :white_check_mark:
lyaa :heavy_check_mark: :heavy_check_mark:

Conclusion

Most of the values look good. The a_loss difference, however, looks small but we can not say yet if it is significant. To do this we need to make the network smaller to decrease the proportion of small numerical differences.

rickstaa commented 4 years ago

Compare gradients (small network)

Let's compare the gradients after one epoch for the following simple network structure:

NETWORK_STRUCTURE = {
    # "critic": [128, 128],
    "critic": [6, 6],
    # "actor": [64, 64],
    "actor": [6, 6],
}  # The network structure of the agent.

Pre-tests

Main train variables

Diagnostics before

Variable Eager Graph
l_delta -0.9528182 -0.9528182
labda_before 0.99 0.99
alpha_before 0.99 0.99
labda_loss -0.009576134 -0.009576134
alpha_loss -0.032539286 -0.032539286
l_target :wavy_dash: :wavy_dash:
a_loss -2.1685486 -2.1685486
l_error 3.0604205 3.0604205
entropy 1.2376348 1.2376348

From this we can conclude the following:

Diagnostics during

Order of execution is not determined so we can not say when in the optimization process the variables were retrieved.

Variable Eager Graph
l_delta -1.0152079 -1.0152079
labda_new 0.989901 0.989901
alpha_new 0.989901 0.989901
labda_loss -0.010203171 -0.010203171
alpha_loss -0.033133637 -0.033133633
l_target :heavy_check_mark: :heavy_check_mark:
l_error 3.1166785 3.1166794
a_loss :x: -2.2905588 :x: -2.2888603
entropy 1.2987194 1.2987194

Diagnostics after

Variable Eager Graph
l_delta -0.9528182 -1.1385257
labda_new 0.989901 0.989901
alpha_new 0.989901 0.989901
labda_loss -0.011554972 -0.011556407
alpha_loss -0.032539286 -0.03346716
l_target :heavy_check_mark: :heavy_check_mark:
l_error 3.248716 3.2487123
a_loss :x: -2.4128022 :x: -2.411079
entropy 1.2990335 1.2987194

Labmda grads

Eager:

1.0152079

Graph:

1.0152079

Alpha grads

Eager:

3.296772

Graph:

3.2967722

Actor grads

l1/weights (grads)

Eager:

[[ 5.2500919e-02  1.1620135e-04 -3.0815847e-02  1.6012463e-01
   0.0000000e+00  4.8186995e-02]
 [ 8.9347973e-02 -4.9473706e-04 -3.4921229e-02  2.1411180e-01
   0.0000000e+00  3.2286722e-02]]

Graph:

[[ 0.0754362   0.00113005 -0.04106913  0.22241642  0.          0.06535622]
 [ 0.11783987 -0.00162632 -0.05778477  0.19093093  0.          0.03070358]]

l1/bias (grads)

Eager:

[ 0.11513847 -0.00067456 -0.04952521  0.30863664  0.          0.0613342 ]

Graph:

[ 0.16016103  0.00188379 -0.07841549  0.37294954  0.          0.09540796]

l2/weights (grads)

Eager:

print(a_grads[2].numpy())
[[-0.02456507  0.         -0.00893248  0.04531651  0.         -0.00417774]
 [-0.00039744  0.          0.          0.00834919  0.          0.        ]
 [-0.13620663  0.         -0.01176214  0.21431652  0.         -0.00474387]
 [-0.05777049  0.         -0.00197002  0.09348982  0.         -0.00054915]
 [ 0.          0.          0.          0.          0.          0.        ]
 [-0.00427895  0.          0.          0.02306459  0.          0.        ]]

Graph:

[[-0.02465731  0.          0.00749758  0.02818175  0.          0.00208641]
 [-0.00069051  0.          0.          0.02741862  0.          0.        ]
 [-0.17271051  0.          0.00901626  0.23800313  0.          0.0023673 ]
 [-0.0758779   0.          0.0012326   0.12511091  0.          0.00027334]
 [ 0.          0.          0.          0.          0.          0.        ]
 [-0.00482392  0.          0.          0.06110027  0.          0.        ]]

l2/bias (grads)

Eager:

[-0.19133787  0.         -0.02154562  0.30950645  0.         -0.01008495]

Graph:

[-0.24081628  0.          0.02007396  0.40998226  0.          0.00488109]

mu/weights (grads)

Eager:

print(a_grads[4].numpy())
[[ 1.09453360e-03 -9.82757844e-03]
 [ 0.00000000e+00  0.00000000e+00]
 [-8.12139129e-04 -1.01391226e-03]
 [ 3.15092951e-02 -5.21462001e-02]
 [ 0.00000000e+00  0.00000000e+00]
 [ 1.35928085e-05 -1.19535880e-05]]

Graph:

[[ 3.5226092e-02  2.2917608e-02]
 [ 0.0000000e+00  0.0000000e+00]
 [ 2.5648137e-03  7.8825658e-04]
 [ 2.1228375e-01  9.6085809e-02]
 [ 0.0000000e+00  0.0000000e+00]
 [-2.2931804e-06 -1.1150328e-06]]

mu/bias (grads)

Eager:

[ 0.10589228 -0.1454632 ]

Graph:

[0.76495534 0.40272963]

log_sigma/weights (grads)

Eager:

[[ 1.6550789e-02 -9.0415258e-04]
 [ 0.0000000e+00  0.0000000e+00]
 [ 1.5210815e-03  4.6642643e-04]
 [ 1.1167094e-01  1.3123514e-02]
 [ 0.0000000e+00  0.0000000e+00]
 [ 2.0352798e-05  7.2834946e-06]]

Graph:

[[ 2.1130934e-02  5.6321733e-03]
 [ 0.0000000e+00  0.0000000e+00]
 [ 1.3668446e-03  1.6134559e-03]
 [ 1.5186840e-01  3.3842545e-02]
 [ 0.0000000e+00  0.0000000e+00]
 [-5.4714778e-06 -6.8443464e-06]]

log_sigma/bias (grads)

Eager:

[0.40362728 0.07397197]

Graph:

[0.5506741  0.15411618]

Critic grads

l1/w1_s (grads)

Eager:

[[ 2.9867558   2.2883155   2.1625774  -0.90700424  0.          0.0518655 ]
 [ 3.1527486   2.3180034   2.3671355  -1.0203922   0.          0.005802  ]]

Graph:

[[ 2.9867563   2.2883155   2.1625774  -0.90700424  0.          0.05186551]
 [ 3.1527488   2.3180037   2.3671355  -1.0203921   0.          0.005802  ]]

l1/w1_a (grads)

Eager:

[[ 3.1506193   2.4131808   2.281676   -0.94762385  0.          0.05612363]
 [ 3.3178866   2.4652028   2.4696414  -1.0560501   0.          0.012898  ]]

Graph:

[[ 3.150619    2.4131804   2.2816758  -0.94762385  0.          0.05612364]
 [ 3.3178868   2.4652026   2.4696414  -1.05605     0.          0.01289801]]

l1/b1 (grads)

Eager:

[[ 4.8441653  3.64106    3.5664701 -1.4816256  0.         0.0633439]]

Graph:

[[ 4.8441663   3.6410596   3.5664706  -1.4816254   0.          0.06334392]]

l2/weights (grads)

Eager:

[[5.6584263e+00 2.8719792e+00 3.6909237e+00 4.2060099e+00 2.4260844e-03
  2.5478518e-03]
 [3.3237767e+00 1.6944232e+00 2.1745667e+00 2.4543619e+00 1.6249337e-03
  1.7065422e-03]
 [7.8306751e+00 3.9162765e+00 5.0567374e+00 5.9484262e+00 1.7874738e-03
  1.8767815e-03]
 [1.4201217e+00 7.5646693e-01 9.5715249e-01 9.7712606e-01 1.6036956e-03
  1.6844459e-03]
 [0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
  0.0000000e+00]
 [3.0412067e-02 1.9115413e-02 1.7548755e-02 6.4070537e-03 1.0267831e-03
  1.0789779e-03]]

Graph:

[[5.6584263e+00 2.8719795e+00 3.6909235e+00 4.2060099e+00 2.4260848e-03
  2.5478525e-03]
 [3.3237765e+00 1.6944231e+00 2.1745663e+00 2.4543619e+00 1.6249340e-03
  1.7065426e-03]
 [7.8306746e+00 3.9162760e+00 5.0567374e+00 5.9484258e+00 1.7874741e-03
  1.8767819e-03]
 [1.4201216e+00 7.5646687e-01 9.5715231e-01 9.7712600e-01 1.6036959e-03
  1.6844464e-03]
 [0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
  0.0000000e+00]
 [3.0412070e-02 1.9115414e-02 1.7548757e-02 6.4070541e-03 1.0267833e-03
  1.0789782e-03]]

l2/bias (grads)

Eager:

[4.9451823e+00 2.4973552e+00 3.2077954e+00 3.6949959e+00 2.5034877e-03
 2.6265378e-03]

Graph:

[4.9451823e+00 2.4973550e+00 3.2077959e+00 3.6949949e+00 2.5034882e-03
 2.6265385e-03]

Conclusion

There are two things that are strange notice about the above results:

rickstaa commented 4 years ago

Validate optimization order

This might have caused the differences between the current implementations. When looking at the TensorFlow graph it looks like the execution error was as follows:

a_loss --> target update -->l_loss 

Unfortunately, this is not sure since I did not find whether the Tensorboard direction is the execution direction. Further, the order of execution in TensorFlow 1.0 is not predetermined when control dependencies are not specified. Meaning even if the graph order is shown in Tensorboard the order of retrieving the diagnostics variables is unknown.

Try different orders

Currently, the order in which the gradients of the actor are most similar is:

1. Calculate l_delta based on current weights
2. Update Critic.
3. Update Labda
4. Update Alpha
5. Update Actor

Conclusion

We can not draw conclusions yet since the changes are too small. I think it is wiser to first check whether the gradients are working.

rickstaa commented 4 years ago

Validate whether the backpropagation is working

As stated above we can validate whether the gradients are working by creating a small test script in which we input a known fixed loss to the optimizer while using the same random seeds in the Eager and Graph version. I added these scripts in 35b6cb2. Open sandbox/validate_grad_tf2_eager.py and sandbox/validate_grad_tf2.py to compare the gradients.

Results gradient compare

:exclamation: The computed gradients are different although the a_loss is equal.

Input variables

Variable Eager Graph
batch :wavy_dash: :wavy_dash:
l_delta -0.9528182 -0.9528182
labda 0.99 0.99
alpha 0.99 0.99
lyal :wavy_dash: :wavy_dash:
l :wavy_dash: :wavy_dash:
l_error :wavy_dash: :wavy_dash:
entropy :wavy_dash: :wavy_dash:
a_loss -1084.2743 -1084.2743

Gradients

Actor grads

Eager:

grad/l1/weights:
[[ 13.408794     2.178104   -11.570398   108.1129       0.      46.277092  ]
 [ 30.369755    -0.23067771 -16.505516    87.53128      0.      22.498222]]

grad/l1/bias:
[ 47.802944    0.4118477 -26.826544  185.92018     0.         60.14121]

grad/l2/weights:
[[-12.058103     0.          -0.33448434  14.862488     0.      -1.8673693 ]
 [ -0.30301327   0.           0.          18.134262     0.      0.        ]
 [-58.740616     0.          -1.113349   119.78081      0.      -2.120421  ]
 [-24.779123     0.          -0.4045168   66.25852      0.      -0.24545981]
 [  0.           0.           0.           0.           0.      0.        ]
 [ -3.3655543    0.           0.          41.697414     0.      0.]]

grad/l2/bias:
[-98.14556     0.          1.5422362 206.91672     0.         -4.5077815]

grad/mu/weights:
[[ 4.81568050e+00 -2.37027216e+00]
 [ 0.00000000e+00  0.00000000e+00]
 [-1.84061453e-01  1.09151885e-01]
 [ 3.29189644e+01 -2.93549042e+01]
 [ 0.00000000e+00  0.00000000e+00]
 [-4.64032125e-03  6.64837938e-03]]

grad/mu/bias:
[110.85972 -87.11501]

grad/log_sigma/weights:
[[5.8719816e+00 5.4135699e+00]
 [0.0000000e+00 0.0000000e+00]
 [2.2099029e-01 3.2953596e-01]
 [7.5656143e+01 6.9250507e+00]
 [0.0000000e+00 0.0000000e+00]
 [6.7091000e-04 9.1812750e-03]]

grad/log_sigma/bias:
[238.21657   47.906483]

Graph:

grad/l1/weights:
[[  1.1570635   2.2042375 -11.598429   88.867676    0.         48.9553   ]
 [ 11.854488   -0.2453581 -16.115313   49.29876     0.         24.385675 ]]

grad/l1/bias:
[ 29.577682     0.16548407 -29.669703   144.31357      0.      64.56108   ]

grad/l2/weights:
[[ -9.905338     0.           4.7838783    3.8737621    0.      -3.4614413 ]
 [ -0.30002874   0.           0.          18.411116     0.      0.]
 [-53.221275     0.           5.242466    79.01278      0.      -3.936808]
 [-23.196867     0.           0.53559065  51.048943     0.      -0.45809162]
 [  0.           0.           0.           0.           0.      0. ]
 [ -4.0506964    0.           0.          42.10633      0.      0.]]

grad/l2/bias:
[-98.10234    0.        16.95784  155.67645    0.        -8.705088]

grad/mu/weights:
[[ 1.62482891e+01  1.15455017e+01]
 [ 0.00000000e+00  0.00000000e+00]
 [ 2.01459303e-01  6.31435871e-01]
 [ 1.06661575e+02  4.60808792e+01]
 [ 0.00000000e+00  0.00000000e+00]
 [-3.92461056e-03  9.14150383e-03]]

grad/mu/bias:
[372.89713 178.99115]

grad/log_sigma/weights:
[[7.4654112e+00 1.1838960e+01]
 [0.0000000e+00 0.0000000e+00]
 [3.2749507e-01 6.5140498e-01]
 [8.7392433e+01 4.8770226e+01]
 [0.0000000e+00 0.0000000e+00]
 [1.4014873e-03 9.9075940e-03]]

grad/log_sigma/bias:
[289.77673 198.32103]

Critic grads

Eager:

grad/l1/w1_s:
[[1595.987     1227.7441    1151.4717    -478.28015      0.
    26.529884 ]
 [1678.313     1235.5564    1258.6863    -540.0067       0.
     3.2418227]]

grad/l1/w1_a:
[[1687.625     1297.5898    1218.1921    -505.27942      0.
    27.287476 ]
 [1734.5205    1291.3756    1288.8228    -548.12964      0.
     5.6431007]]

grad/l1/b1:
[[2585.8762   1950.8157   1897.8624   -787.2867      0.         32.832127]]

grad/l2/weights:
[[3.0088884e+03 1.5339800e+03 1.9696803e+03 2.2246345e+03 1.3444366e+00
  1.4125073e+00]
 [1.7697002e+03 9.0638995e+02 1.1622632e+03 1.2993762e+03 9.0058529e-01
  9.4620943e-01]
 [4.1461562e+03 2.0810303e+03 2.6844573e+03 3.1366470e+03 9.8965234e-01
  1.0395525e+00]
 [7.6887640e+02 4.1175214e+02 5.2082928e+02 5.2538739e+02 8.8927078e-01
  9.3442780e-01]
 [0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
  0.0000000e+00]
 [1.5589552e+01 9.7931461e+00 8.8503208e+00 3.0852060e+00 5.7044661e-01
  5.9966487e-01]]

grad/l2/bias:
[2.6395024e+03 1.3400250e+03 1.7193615e+03 1.9598086e+03 1.3959280e+00
 1.4653397e+00]

Graph:

grad/l1/w1_s:
[[1595.987     1227.7441    1151.4717    -478.28015      0.
    26.529884 ]
 [1678.313     1235.5564    1258.6863    -540.0067       0.
     3.2418227]]

grad/l1/w1_a:
[[1687.625     1297.5898    1218.1921    -505.27942      0.
    27.287476 ]
 [1734.5205    1291.3756    1288.8228    -548.12964      0.
     5.6431007]]

grad/l1/b1:
[[2585.8762   1950.8157   1897.8624   -787.2867      0.         32.832127]]

grad/l2/weights:
[[3.0088884e+03 1.5339800e+03 1.9696803e+03 2.2246345e+03 1.3444366e+00
  1.4125073e+00]
 [1.7697002e+03 9.0638995e+02 1.1622632e+03 1.2993762e+03 9.0058529e-01
  9.4620943e-01]
 [4.1461562e+03 2.0810303e+03 2.6844573e+03 3.1366470e+03 9.8965234e-01
  1.0395525e+00]
 [7.6887640e+02 4.1175214e+02 5.2082928e+02 5.2538739e+02 8.8927078e-01
  9.3442780e-01]
 [0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00
  0.0000000e+00]
 [1.5589552e+01 9.7931461e+00 8.8503208e+00 3.0852060e+00 5.7044661e-01
  5.9966487e-01]]

grad/l2/bias:
[2.6395024e+03 1.3400250e+03 1.7193615e+03 1.9598086e+03 1.3959280e+00
 1.4653397e+00]

Conclusion

As we can see from the above results the gradients that are computed in the eager version are wrong. We have to find out why this is the case.

rickstaa commented 4 years ago

Debug gradients difference

Debug steps

Are all required gradients computed?

Moving all the code inside the gradient taper and removing the tf.stop_gradients() doesn't change the gradients. This is as expected since the gradients of the actor should not be computed w.r.t. alpha, lambda, l and l_.

Different Squashed Gaussian actor

Does not solve the problem.

rickstaa commented 4 years ago

@panweihit I opened an issue on the TensorFlow github as it is very unexpected that the gradients are different when the loss functions, random seeds, weights/biases and inputs are equal.

rickstaa commented 4 years ago

I think I just found something. This problem is fixed when I change the loss function from:

a_loss = GRAD_SCALE_FACTOR * (
        labda * l_delta + alpha * tf.reduce_mean(input_tensor=log_pis)
    ) 

To:

a_loss = GRAD_SCALE_FACTOR * (alpha * tf.reduce_mean(input_tensor=log_pis))

Meaning that something goes wrong with the lya_ga_.

rickstaa commented 4 years ago

@panweihit After more debugging using the very helpfull tensorboard debugger I found the problem. The gradients of the l_delta were None and thus the agent was only training based on the log probabilities. This was solved by making sure that all the required gradients were taped. The current working version is as follows. This version makes total sense with the formulas in the article.

    @tf.function
    def learn(self, LR_A, LR_L, LR_lag, batch):
        """Runs the SGD to update all the optimize parameters.

        Args:
            LR_A (float): Current actor learning rate.
            LR_L (float): Lyapunov critic learning rate.
            LR_lag (float): Lyapunov constraint langrance multiplier learning rate.
            batch (numpy.ndarray): The batch of experiences.

        Returns:
            [type]: [description]
        """

        # Retrieve state, action and reward from the batch
        bs = batch["s"]  # state
        ba = batch["a"]  # action
        br = batch["r"]  # reward
        bterminal = batch["terminal"]
        bs_ = batch["s_"]  # next state

        # Update target networks
        self.update_target()

        # Get Lypaunov target
        a_, _, _ = self.ga_(bs_)
        l_ = self.lc_([bs_, a_])
        l_target = br + ALG_PARAMS["gamma"] * (1 - bterminal) * tf.stop_gradient(l_)

        # Lyapunov candidate constraint function graph
        with tf.GradientTape() as l_tape:

            # Calculate current lyapunov value
            l = self.lc([bs, ba])

            # Calculate L_backup
            l_error = tf.compat.v1.losses.mean_squared_error(
                labels=l_target, predictions=l
            )

        # Actor loss and optimizer graph
        with tf.GradientTape() as a_tape:

            # Calculate current value and target lyapunov multiplier value
            lya_a_, _, _ = self.ga(bs_)
            lya_l_ = self.lc([bs_, lya_a_])

            # Calculate Lyapunov constraint function
            self.l_delta = tf.reduce_mean(lya_l_ - l + (ALG_PARAMS["alpha3"]) * br)

            # Calculate log probability of a_input based on current policy
            _, _, log_pis = self.ga(bs)

            # Calculate actor loss
            a_loss = self.labda * self.l_delta + self.alpha * tf.reduce_mean(log_pis)

        # Lagrance multiplier loss functions and optimizers graphs
        with tf.GradientTape() as lambda_tape:
            labda_loss = -tf.reduce_mean(self.log_labda * self.l_delta)

        # Calculate alpha loss
        with tf.GradientTape() as alpha_tape:
            alpha_loss = -tf.reduce_mean(
                self.log_alpha * tf.stop_gradient(log_pis + self.target_entropy)
            )  # Trim down

        # Apply lambda gradients
        lambda_grads = lambda_tape.gradient(labda_loss, [self.log_labda])
        self.lambda_train.apply_gradients(zip(lambda_grads, [self.log_labda]))

        # Apply alpha gradients
        alpha_grads = alpha_tape.gradient(alpha_loss, [self.log_alpha])
        self.alpha_train.apply_gradients(zip(alpha_grads, [self.log_alpha]))

        # Apply actor gradients
        a_grads = a_tape.gradient(a_loss, self.ga.trainable_variables)
        self.a_train.apply_gradients(zip(a_grads, self.ga.trainable_variables))

        # Apply critic gradients
        l_grads = l_tape.gradient(l_error, self.lc.trainable_variables)
        self.l_train.apply_gradients(zip(l_grads, self.lc.trainable_variables))

        # Return results
        return (
            self.labda,
            self.alpha,
            l_error,
            tf.reduce_mean(tf.stop_gradient(-log_pis)),
            a_loss,
        )

In the end, this error, combined with the mistake explained in #7 prevented the agent from learning and resulted in me having to perform a lot of debugging in the last three weeks. On a positive note, I now feel very confident in both tf1, tf2, and PyTorch and have a good onderstanding of both the SAC and LAC algorithms :). More importantly, we now have a clean tf1, tf2, and PyTorch solution! :tada:

rickstaa commented 4 years ago

Fixed in 4b18fb3761bf4f76d66299659e88eb8d756ee69e :tada: