stheid / safety_guarded_rl

Other
0 stars 0 forks source link

Safe Initial policy #1

Closed stheid closed 3 years ago

stheid commented 3 years ago

Transfer LQR controller to PPO

variants:

  1. make PPO policy a linear model and directly copy LQR law into the policy
  2. make behavioural cloning of the LQR into a "standard NN"
stheid commented 3 years ago

to 1: copying the matricies into the model is not trivial. the internal structure of stable baselines is quite complex. The proper way to implement it would be to write a custom policy like the following: https://stable-baselines3.readthedocs.io/en/master/guide/custom_policy.html

However i dont think it is of much use for the investigation. Therefore i will call this point "wontimplement" and continue with behavioral cloning of the lqr controller

stheid commented 3 years ago

To understand the SB code better, i implemented it nevertheless. Stablebaselines has multiple levels of networks involved to create an action

stheid commented 3 years ago

finally i achieved decent behavioural cloning results.

The training after the cloning step even decreases performance a little, but not severely

       count        mean        std         min         25%         50%         75%         max
lqr    100.0  949.047623  49.595922  850.259757  915.872177  966.129322  992.482517  999.999787
bc     100.0  949.046080  49.596808  850.256670  915.869798  966.127579  992.481915  999.999776
final  100.0  943.078343  55.069499  835.439175  905.530398  961.190409  991.048588  999.813298

  "bc_expert_eps": 10000,
  "bc_train_eps": 1,
  "eps_steps": 1000,
  "eval_eps": 100,
  "train_steps": 100000

With less training trajectories to learn from, there is almost no learning visible even though there are exactly the same number of BC update steps (expert_eps · train_eps)

       count        mean         std          min          25%          50%         75%         max
lqr    100.0  947.379945   48.888214   839.002495   904.725817   968.144561  991.483306  999.996128
bc     100.0 -862.053960  505.050367 -1001.961472 -1000.791883 -1000.000000 -999.151462  994.238404
final  100.0  108.964674  988.968313 -1003.393488 -1000.840948   957.294230  989.254138  997.798086

  "bc_expert_eps": 100,
  "bc_train_eps": 100,
  "eps_steps": 1000,
  "eval_eps": 100,
  "train_steps": 100000

in this scenario the continued training after BC allows for quite significant improvement.

Infact the cloned controller can even outperform the LQR, although it must be noted that the cloned controller is not linear anymore. (note that the episodes are shorter and the number of expert episodes is increased 10-fold


       count       mean        std         min        25%        50%        75%        max
lqr    100.0  54.489127  42.283219  -39.246091  17.597744  72.448360  92.633946  99.996651
bc     100.0  54.491743  42.278817  -39.236334  17.603381  72.446586  92.632861  99.996672
final  100.0  54.175285  41.523940 -100.000000  20.922805  70.192927  91.138021  97.088276

  "bc_expert_eps": 100000,
  "bc_train_eps": 1,
  "eps_steps": 100,
  "eval_eps": 100,
  "train_steps": 100000
stheid commented 3 years ago

Unfortunately with the given random seeds the LQR is also unsafe.

stheid commented 3 years ago

BC definelty reduced the amount of trainingsteps needed

       count        mean         std          min          25%          50%         75%         max
lqr    100.0  949.047623   49.595922   850.259757   915.872177   966.129322  992.482517  999.999787
final  100.0 -831.020940  459.530430 -1098.463165 -1067.507492 -1027.049562 -830.107261  597.252085

  "eps_steps": 1000,
  "eval_eps": 100,
  "train_steps": 2*10**5
       count        mean         std          min          25%         50%         75%         max
lqr    100.0  949.047623   49.595922   850.259757   915.872177  966.129322  992.482517  999.999787
final  100.0  416.952582  847.711829 -1004.903101 -1000.000000  876.078586  956.963459  993.041752

  "eps_steps": 1000,
  "eval_eps": 100,
  "train_steps": 10**6