Closed stheid closed 3 years ago
to 1: copying the matricies into the model is not trivial. the internal structure of stable baselines is quite complex. The proper way to implement it would be to write a custom policy like the following: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html
However i dont think it is of much use for the investigation. Therefore i will call this point "wontimplement" and continue with behavioral cloning of the lqr controller
To understand the SB code better, i implemented it nevertheless. Stablebaselines has multiple levels of networks involved to create an action
finally i achieved decent behavioural cloning results.
The training after the cloning step even decreases performance a little, but not severely
count mean std min 25% 50% 75% max
lqr 100.0 949.047623 49.595922 850.259757 915.872177 966.129322 992.482517 999.999787
bc 100.0 949.046080 49.596808 850.256670 915.869798 966.127579 992.481915 999.999776
final 100.0 943.078343 55.069499 835.439175 905.530398 961.190409 991.048588 999.813298
"bc_expert_eps": 10000,
"bc_train_eps": 1,
"eps_steps": 1000,
"eval_eps": 100,
"train_steps": 100000
With less training trajectories to learn from, there is almost no learning visible even though there are exactly the same number of BC update steps (expert_eps · train_eps)
count mean std min 25% 50% 75% max
lqr 100.0 947.379945 48.888214 839.002495 904.725817 968.144561 991.483306 999.996128
bc 100.0 -862.053960 505.050367 -1001.961472 -1000.791883 -1000.000000 -999.151462 994.238404
final 100.0 108.964674 988.968313 -1003.393488 -1000.840948 957.294230 989.254138 997.798086
"bc_expert_eps": 100,
"bc_train_eps": 100,
"eps_steps": 1000,
"eval_eps": 100,
"train_steps": 100000
in this scenario the continued training after BC allows for quite significant improvement.
Infact the cloned controller can even outperform the LQR, although it must be noted that the cloned controller is not linear anymore. (note that the episodes are shorter and the number of expert episodes is increased 10-fold
count mean std min 25% 50% 75% max
lqr 100.0 54.489127 42.283219 -39.246091 17.597744 72.448360 92.633946 99.996651
bc 100.0 54.491743 42.278817 -39.236334 17.603381 72.446586 92.632861 99.996672
final 100.0 54.175285 41.523940 -100.000000 20.922805 70.192927 91.138021 97.088276
"bc_expert_eps": 100000,
"bc_train_eps": 1,
"eps_steps": 100,
"eval_eps": 100,
"train_steps": 100000
Unfortunately with the given random seeds the LQR is also unsafe.
BC definelty reduced the amount of trainingsteps needed
count mean std min 25% 50% 75% max
lqr 100.0 949.047623 49.595922 850.259757 915.872177 966.129322 992.482517 999.999787
final 100.0 -831.020940 459.530430 -1098.463165 -1067.507492 -1027.049562 -830.107261 597.252085
"eps_steps": 1000,
"eval_eps": 100,
"train_steps": 2*10**5
count mean std min 25% 50% 75% max
lqr 100.0 949.047623 49.595922 850.259757 915.872177 966.129322 992.482517 999.999787
final 100.0 416.952582 847.711829 -1004.903101 -1000.000000 876.078586 956.963459 993.041752
"eps_steps": 1000,
"eval_eps": 100,
"train_steps": 10**6
Transfer LQR controller to PPO
variants: