stheid / safety_guarded_rl

Other
0 stars 0 forks source link

Experiment with an DDPG #4

Closed stheid closed 3 years ago

stheid commented 3 years ago

similar to #1 i want to do behavioral cloning and also training from scratch not with PPO, but with DDPG instead.

stheid commented 3 years ago

unfortunately this is not easy. The BC class of the imitation library does expect the policy class to implement the function evaluate_actions() which the TD3Policy does not implement, hence i cannot execute behavioral cloning with DDPG out of the box.

Since the DDPG class is again structured a little different from PPO its not duable in a couple of hours. For the time beeing this does not seem worth the efford.

stheid commented 3 years ago

Initally i did some experiments with DDPG trained from scratch. in comparison to #1 the DDPG learns faster, but the policy after 1M update steps is severely diverged. Again i want to note that no hyperparameter tuning is done, therefore the results should be taken with a lot of salt

       count        mean         std          min         25%         50%         75%         max
lqr    100.0  949.047623   49.595922   850.259757  915.872177  966.129322  992.482517  999.999787
final  100.0  893.705969  210.657620 -1014.590534  851.942839  942.820218  987.292857  999.273984
  "eps_steps": 1000,
  "eval_eps": 100,
  "train_steps": 200000

↑ this result is significantly better than PPO with 5 times as many updates

       count        mean         std         min         25%         50%         75%         max
lqr    100.0  949.047623   49.595922  850.259757  915.872177  966.129322  992.482517  999.999787
final  100.0  415.414463  126.001124  212.520792  292.063602  438.576660  496.112191  662.169255
  "eps_steps": 1000,
  "eval_eps": 100,
  "train_steps": 1000000

The behavior cloning works much worse than with PPO, although i again i need to point out that the BC is only carried out with a dummy implementation of the missing function evaluate_actions(). Although i am not sure if this function can even be implemented in the way that the BC is done right now, as it seems that the actions are suppost to stem from a distribution and the evaluate function queries the likelihood of the action to evaluate. Since DDPG ist deterministic + noise i am not shure if this is in perfect alignment with the concept, but perhaps it is. Have not spent do much time on the concept.

All in all, behavioural cloning works a little with the dummy implementation.

       count        mean         std          min          25%          50%         75%         max
lqr    100.0  949.047623   49.595922   850.259757   915.872177   966.129322  992.482517  999.999787
bc     100.0 -644.003221  763.665800 -1002.845810 -1000.609645 -1000.000000 -997.487669  994.701999
final  100.0  342.191169  496.339388  -481.609238     9.336983   244.834932  860.677560  902.735116
  "bc_expert_eps": 10000,
  "bc_train_eps": 1,
  "eps_steps": 1000,
  "eval_eps": 100,
  "train_steps": 100000

       count        mean         std          min          25%          50%         75%         max
lqr    100.0  949.047623   49.595922   850.259757   915.872177   966.129322  992.482517  999.999787
bc     100.0 -685.579418  699.962086 -1002.156745 -1000.626813 -1000.000000 -998.139410  992.454471
final  100.0  519.881668  644.837050 -1099.319561   377.265681   716.106798  963.949821  999.607001
  "bc_expert_eps": 100,
  "bc_train_eps": 100,
  "eps_steps": 1000,
  "eval_eps": 100,
  "train_steps": 100000