rickstaa / stable-learning-control

A framework for training theoretically stable (and robust) Reinforcement Learning control algorithms.

https://rickstaa.dev/stable-learning-control

MIT License

6 stars 1 forks source link

Implement LPG #122

Closed rickstaa closed 2 years ago

rickstaa commented 3 years ago

User story

As discussed in the meeting, we want to implement the LPG agent. @panweihit @dds0117 in this report, I will track the progress of this new algorithm.

Steps

[x] 1. Check if the regular SAC and LAC algorithms train on the CartPole environments.
[x] 2. Remove the Lyapunov constraint and check if the agent can train on the CartPoleCost environment (lac2).
- [x] If not training, check if the SAC algorithm successfully trains on the environment.
- [x] If this is the case, check if the SAC algorithm without the double Q-trick can train successfully trains in the environment (sac2).
- [x] If needed, add a second Lyapunov Critic network, use the double Q trick, and train it on the environment (lac3)
- [x] Let's add the entropy regularization term of SAC to Critic loss function this is more similar to what SAC is doing (lac5).
- [x] Let's move the Lyapunov optimization before the actor loss update (lac6)
[x] 3. Move Lyapunov constraint from the actor loss to the critic loss function (lac4).
[ ] 4. Replace the Q values in the critic loss with the actual value function.
[ ] 5. Approximate the value function with a Gaussian Process.

LAC versions Legenda
LAC: Regular lag.
LAC2: Lac without any lyapunov constrainted (Similar to SAC but with Squared output activation).
LAC3: Lac but now with the double Q-trick added.
LAC4: Lac but not the lyapunov constrained is added to the critic loss instead of the actor loss.
LAC5: Lac but now we also add the entropy regularization term in to the critic (more theoretically correct).
LAC6: Lac but now the Lagrance multipliers are optimized before they are used to optimize the critic and actor.
LAC7: Lac but now we use the minimum lyapunov target in the Lyapunov constrained.
LAC8: Lac but now we replace the strict asymptotic stability in mean cost with general asymptotic stability.
SAC: Regular sac.
SAC2: Sac but without double Q-trick.
SAC3: Sac but now it uses v1 of Haarnoja et al. 2019.
SAC4: Sac but now it uses v2 of Haarnoja et al. 2019.

rickstaa commented 3 years ago

Test training Performance (CartPoleCost)

Let's first test the training performance of the following LAC versions in the CartPoleCost environment:

LAC: The regular LAC algorithm as implemented by Han et al.
LAC2: Version in which the Lyapunov constrained has been removed, but the Lyapunov critic is kept. It is similar to a SAC with a SINGLE Lyapunov critic.
LAC3: Version without Lyapunov constraint but with the double-q trick.
LAC4: Version with Lyapunov constrained but now added to the Critic Loss.
LAC5: LAC1 but with the entropy term also added to the Critic Loss.
LAC6: Like LAC, the Lyapunov constrained is optimized before it is used (I think, in theory, it makes more sense).

Let's also quickly investigate the following SAC versions:

SAC: Regular sac as implemened by Haarnoja et al..
SAC2: Similar to sac, but now the double Q-trick has been removed.

Regular SAC and LAC performance

LAC

Experiment file: experiments/gpl_2021/lac_cart_pole_cost.yml.

As we already know, LAC works.

Open the report

[lac_cart_pole_cost_s0.zip](https://github.com/rickstaa/bayesian-learning-control/files/6277292/lac_cart_pole_cost_s0.zip) ![image](https://user-images.githubusercontent.com/17570430/113995596-ad7a1280-9856-11eb-911f-b8bf0e809996.png) ![Lac_performance](https://user-images.githubusercontent.com/17570430/113995835-e5815580-9856-11eb-94fe-d12819548f7e.png) ![lac_performance](https://user-images.githubusercontent.com/17570430/113996002-0d70b900-9857-11eb-8c86-31b4c6762a06.gif)

SAC

Experiment file: experiments/gpl_2021/sac_cart_pole_cost.yml.

As we already know, SAC can also perform on the CartPoleCost environment.

Open the report

[sac_cart_pole_cost_s1250.zip](https://github.com/rickstaa/bayesian-learning-control/files/6277623/sac_cart_pole_cost_s1250.zip) ![image](https://user-images.githubusercontent.com/17570430/114003415-026d5700-985e-11eb-8b37-071fbc812b5f.png) ![sac_performance](https://user-images.githubusercontent.com/17570430/114003795-58da9580-985e-11eb-8421-07db98d89f1a.png) ![sac_performance](https://user-images.githubusercontent.com/17570430/114003659-38aad680-985e-11eb-895c-f13599d201ca.gif)

SAC 2

Experiment file: experiments/gpl_2021/sac2_cart_pole_cost.yml.

Seems to work fine.

Open the report

[sac2_cart_pole_cost_s1250.zip](https://github.com/rickstaa/bayesian-learning-control/files/6277728/sac2_cart_pole_cost_s1250.zip) ![image](https://user-images.githubusercontent.com/17570430/114006409-b40d8780-9860-11eb-8e04-46f2619bd6bb.png) ![sac2](https://user-images.githubusercontent.com/17570430/114006598-ddc6ae80-9860-11eb-9741-21ead573e03a.png) ![sac2](https://user-images.githubusercontent.com/17570430/114006765-00f15e00-9861-11eb-94f9-0c8e9d32ce86.gif)

LAC 2

Experiment file: experiments/gpl_2021/lac2_cart_pole_cost.yml.

Also works.

Open the report

[lac2_cart_pole_cost_s1250.zip](https://github.com/rickstaa/bayesian-learning-control/files/6277837/lac2_cart_pole_cost_s1250.zip) ![image](https://user-images.githubusercontent.com/17570430/114009720-ae657100-9863-11eb-9754-480ff2cfc40f.png) ![lac2](https://user-images.githubusercontent.com/17570430/114009930-e5d41d80-9863-11eb-8492-a8aad62aca35.png) ![lac2](https://user-images.githubusercontent.com/17570430/114010036-ff756500-9863-11eb-8fd3-ba52ae750e6d.gif)

LAC3

Experiment file: experiments/gpl_2021/lac3_cart_pole_cost.yml.

Also works.

Open the report

[lac3_cart_pole_cost_s1250.zip](https://github.com/rickstaa/bayesian-learning-control/files/6278450/lac3_cart_pole_cost_s1250.zip) ![image](https://user-images.githubusercontent.com/17570430/114018851-fa1d1800-986d-11eb-841a-f01e659c1c41.png) ![lac3](https://user-images.githubusercontent.com/17570430/114018941-1b7e0400-986e-11eb-802f-0f7f60e5fc60.png) ![lac3](https://user-images.githubusercontent.com/17570430/114023149-cc869d80-9872-11eb-9984-64aad8a14157.gif)

LAC4

Experiment file: experiments/gpl_2021/lac4_cart_pole_cost.yml.

Also works but after this first test, it looks like performance is worse. This could also be due to random factors.

Open the report

[lac4_cart_pole_cost_s1250.zip](https://github.com/rickstaa/bayesian-learning-control/files/6278564/lac4_cart_pole_cost_s1250.zip) ![image](https://user-images.githubusercontent.com/17570430/114025772-c9d97780-9875-11eb-86e6-3d5ba396df3a.png) ![lac4_performance](https://user-images.githubusercontent.com/17570430/114025901-f097ae00-9875-11eb-83fa-d2ee31660121.png) ![lac4_performance](https://user-images.githubusercontent.com/17570430/114026095-2472d380-9876-11eb-9539-b1e2cc4028e7.gif)

LAC5

Experiment file: experiments/gpl_2021/lac5_cart_pole_cost.yml.

Works as expected.

Open the report

[lac5_cart_pole_cost_s1250.zip](https://github.com/rickstaa/bayesian-learning-control/files/6279545/lac5_cart_pole_cost_s1250.zip) ![image](https://user-images.githubusercontent.com/17570430/114048680-14b1ba00-988b-11eb-913d-04d1ddc47416.png) ![lac5_performance](https://user-images.githubusercontent.com/17570430/114048894-3dd24a80-988b-11eb-8fb8-fbc631282925.png) ![lac5_performance](https://user-images.githubusercontent.com/17570430/114055421-0e264100-9891-11eb-8541-cc9d2c5c665e.gif)

LAC6

Experiment file: experiments/gpl_2021/lac6_cart_pole_cost.yml.

Works as expected.

Open the report

[lac6_cart_pole_cost_s1250.zip](https://github.com/rickstaa/bayesian-learning-control/files/6280253/lac6_cart_pole_cost_s1250.zip) ![image](https://user-images.githubusercontent.com/17570430/114066890-7dedf900-989c-11eb-91ab-78fb1f0af62e.png) ![Uploading lac6_performance.gif…]() ![lac6_performance](https://user-images.githubusercontent.com/17570430/114067138-c4435800-989c-11eb-881f-71b298f375a2.gif)

Conclusion

All algorithms are able to train. For simplicity let's first work with LAC4 as we can make the other changes later. For this algorithm, we should look at the robustness against disturbances with the original LAC algorithm

rickstaa commented 3 years ago

Disturbance robustness evaluation (CartPoleCost)

LAC original results

Seems to work fine

Open the report

## Seed 0 [lac_cart_pole_cost_s0.zip](https://github.com/rickstaa/bayesian-learning-control/files/6312399/lac_cart_pole_cost_s0.zip) ### Performance ![Lac_performance](https://user-images.githubusercontent.com/17570430/114747380-10851100-9d51-11eb-8948-a2b1b17b672b.png) ![lac_performance](https://user-images.githubusercontent.com/17570430/114748945-b38a5a80-9d52-11eb-9e3a-368fc7253ba2.gif) ### Robustness eval ![image](https://user-images.githubusercontent.com/17570430/114751036-f3524180-9d54-11eb-8417-a8a264a9be6f.png) ![image](https://user-images.githubusercontent.com/17570430/114751171-1846b480-9d55-11eb-8ef7-eb6ad2b4e6cd.png) ![image](https://user-images.githubusercontent.com/17570430/114751187-1e3c9580-9d55-11eb-8c45-76e6f22653a3.png) ![image](https://user-images.githubusercontent.com/17570430/114751204-23014980-9d55-11eb-9360-6d53eb21f6b0.png) ![image](https://user-images.githubusercontent.com/17570430/114751218-27c5fd80-9d55-11eb-8d2c-fade7591f9f4.png) ![image](https://user-images.githubusercontent.com/17570430/114751351-58a63280-9d55-11eb-9a11-075198183cf5.png) ![image](https://user-images.githubusercontent.com/17570430/114751247-33192900-9d55-11eb-84fa-10f2ee3aa60e.png) ![lac_soi](https://user-images.githubusercontent.com/17570430/114751874-fc8fde00-9d55-11eb-9c39-69b861d54b64.png) ## Seed 1250

LAC4 results

Seems to give the same results as the original lac.

Open the report

## Seed 0 [lac4_cart_pole_cost_s0.zip](https://github.com/rickstaa/bayesian-learning-control/files/6312714/lac4_cart_pole_cost_s0.zip) ### Performance ![lac4_performance_0](https://user-images.githubusercontent.com/17570430/114755379-da985a80-9d59-11eb-979c-0e831d61ce32.png) ![lac4_performance_0](https://user-images.githubusercontent.com/17570430/114755372-d9672d80-9d59-11eb-9209-7ef87e6c19f3.gif) ### Robustness eval ![image](https://user-images.githubusercontent.com/17570430/114755716-437fd280-9d5a-11eb-90b1-3c3b5c8c9bc4.png) ![image](https://user-images.githubusercontent.com/17570430/114755742-4975b380-9d5a-11eb-83fb-d4be0d826f65.png) ![image](https://user-images.githubusercontent.com/17570430/114755758-4da1d100-9d5a-11eb-966c-dff33b9a1f37.png) ![image](https://user-images.githubusercontent.com/17570430/114755895-71fdad80-9d5a-11eb-9622-4c8f3c972b69.png) ![image](https://user-images.githubusercontent.com/17570430/114755923-77f38e80-9d5a-11eb-8ce1-46f68756bbc7.png) ![image](https://user-images.githubusercontent.com/17570430/114755942-7de96f80-9d5a-11eb-9e56-d255d1b5fdf6.png) ![image](https://user-images.githubusercontent.com/17570430/114755776-52ff1b80-9d5a-11eb-991c-baeecff21f0b.png) ![lac4_soi_0](https://user-images.githubusercontent.com/17570430/114756189-c739bf00-9d5a-11eb-89cb-96eaceba50fe.png) ## Seed 1250

SAC original results

Like in Han et al. 2020 the robustness is lower than the LAC algorithm. Related to that the algorithm also has a higher deadrate.

Open the report

## Seed 0 ### Performance [sac_cart_pole_cost_s0.zip](https://github.com/rickstaa/bayesian-learning-control/files/6313040/sac_cart_pole_cost_s0.zip) ![sac_performance_0](https://user-images.githubusercontent.com/17570430/114762931-8ba2f300-9d62-11eb-871b-ff86310e03fb.png) ![sac_performance_0](https://user-images.githubusercontent.com/17570430/114762933-8cd42000-9d62-11eb-8b85-f91369e51a95.gif) ### Robustness eval ![image](https://user-images.githubusercontent.com/17570430/114763070-b3925680-9d62-11eb-8f20-d69d3cdc017e.png) ![image](https://user-images.githubusercontent.com/17570430/114763089-b8570a80-9d62-11eb-9611-33ac07beea94.png) ![image](https://user-images.githubusercontent.com/17570430/114763107-bd1bbe80-9d62-11eb-80b2-06d21b9ca2b8.png) ![image](https://user-images.githubusercontent.com/17570430/114763132-c1e07280-9d62-11eb-9c5f-751aa81e3a6c.png) ![image](https://user-images.githubusercontent.com/17570430/114763150-c60c9000-9d62-11eb-9c79-b2045d4f58fe.png) ![image](https://user-images.githubusercontent.com/17570430/114763165-cc027100-9d62-11eb-97ba-6cafcd5a59e9.png) ![soi_sac](https://user-images.githubusercontent.com/17570430/114763220-dae92380-9d62-11eb-925b-7da8f9a16d77.png)

rickstaa commented 3 years ago

Disturbance robustness evaluation (Oscillator)

LAC original results

Open the report

## Seed 0 ## Seed 1250 ### Performance ![lac_perofrmance](https://user-images.githubusercontent.com/17570430/115113734-0b5fd600-9f8c-11eb-91de-a58275471d4b.png) ![image](https://user-images.githubusercontent.com/17570430/115113764-2df1ef00-9f8c-11eb-97ee-7ae6dc4e9c97.png) ### Robustness eval #### Look at K ![image](https://user-images.githubusercontent.com/17570430/115113917-02bbcf80-9f8d-11eb-8494-ca432a032582.png) ![image](https://user-images.githubusercontent.com/17570430/115113922-06e7ed00-9f8d-11eb-8406-aa9394d959fb.png) ![image](https://user-images.githubusercontent.com/17570430/115113929-09e2dd80-9f8d-11eb-88af-f792b20751b0.png) ![lac_soi2](https://user-images.githubusercontent.com/17570430/115113914-f899d100-9f8c-11eb-8781-4d189c520326.png) #### Look at a1 (c1) ![image](https://user-images.githubusercontent.com/17570430/115113792-57127f80-9f8c-11eb-86d2-e7614f8b4056.png) ![image](https://user-images.githubusercontent.com/17570430/115113807-62fe4180-9f8c-11eb-96b3-4cef9688af20.png) ![image](https://user-images.githubusercontent.com/17570430/115113825-77423e80-9f8c-11eb-904d-d689b103b044.png) ![lac_soi](https://user-images.githubusercontent.com/17570430/115113852-95a83a00-9f8c-11eb-874f-c75fe8bff026.png)

LAC4 results

Seems to give the same results as the original lac.

Open the report

## Seed 0 ## Seed 1250 ### Performance ![osc_lac4_performance](https://user-images.githubusercontent.com/17570430/115112964-29c3d280-9f88-11eb-8e42-2cb48b681dcb.png) ![image](https://user-images.githubusercontent.com/17570430/115112958-26304b80-9f88-11eb-910a-e6f40f435b18.png) ### Robustness eval #### Look at K ![image](https://user-images.githubusercontent.com/17570430/115113289-e0748280-9f89-11eb-9f39-c5b11cc1f411.png) ![image](https://user-images.githubusercontent.com/17570430/115113303-f124f880-9f89-11eb-82cc-3236167bb407.png) ![image](https://user-images.githubusercontent.com/17570430/115113311-03069b80-9f8a-11eb-829a-ce4d1709cbfd.png) ![lac4_soi](https://user-images.githubusercontent.com/17570430/115113476-e61e9800-9f8a-11eb-905c-38c2af80d3c0.png) #### Look at a1 (c1) ![image](https://user-images.githubusercontent.com/17570430/115113518-1fef9e80-9f8b-11eb-821d-766fe6c3dae3.png) ![image](https://user-images.githubusercontent.com/17570430/115113536-2a119d00-9f8b-11eb-8b5d-64a389c2ee87.png) ![image](https://user-images.githubusercontent.com/17570430/115113556-40b7f400-9f8b-11eb-8ce4-29e417611715.png) ![lac4_soi2](https://user-images.githubusercontent.com/17570430/115113571-50373d00-9f8b-11eb-9fcb-27b3b37b636a.png)

SAC original results

Open the report

## Seed 1250 ### Performance ![sac_performance](https://user-images.githubusercontent.com/17570430/115113889-d1db9a80-9f8c-11eb-861c-a8809b63c833.png) ![image](https://user-images.githubusercontent.com/17570430/115113895-ddc75c80-9f8c-11eb-94a3-f188d9e087e2.png) ### Robustness eval #### Look at K ![image](https://user-images.githubusercontent.com/17570430/115114013-60e8b280-9f8d-11eb-820f-1bb949162b13.png) ![image](https://user-images.githubusercontent.com/17570430/115114020-6940ed80-9f8d-11eb-8139-2d7342f9c90b.png) ![image](https://user-images.githubusercontent.com/17570430/115114025-6c3bde00-9f8d-11eb-8092-25305709df65.png) ![sac_soi](https://user-images.githubusercontent.com/17570430/115114045-82499e80-9f8d-11eb-8e72-51c35430f34f.png) #### Look at a1 (c1) ![image](https://user-images.githubusercontent.com/17570430/115114096-d05ea200-9f8d-11eb-95cf-a8f5f7bde415.png) ![image](https://user-images.githubusercontent.com/17570430/115114100-d5bbec80-9f8d-11eb-8953-09634c14a770.png) ![image](https://user-images.githubusercontent.com/17570430/115114105-d9e80a00-9f8d-11eb-9b34-2a52f8479e3a.png) ![sac_soi2](https://user-images.githubusercontent.com/17570430/115114120-ebc9ad00-9f8d-11eb-8dbb-3e05a1feb749.png)

rickstaa commented 3 years ago

Meeting notes 17-04-2021

We were able to improve the LAC robustness by only using the Lyapunov Value that came from the best action given the current policy.
We found out that the alpha3*R term can be dropped and a simple alpha3 term can be used. This results in a softer version of Lyapunov stability (derivative is less negative), but this version can be used to make any cost function stable in the sense of Lyapunov (more practical).
We further found the following problems in which we might possibly test the new LAC algorithm in the future.
- Mark time Humanoid: Like a soldier marching in place. Can also include upper body movements or frequency requirements.
- Cheetath: Hopping in place + frequencies
- Bicycle: Maybe in the future we can use this or [this environment](Cheetath: Hopping in place + frequencies).
- Drone: We leave it for now but maybe later we can use flightmare.
- Car tracking: We can do the steering manouvre test with this simulator.
- Cubli walking: Gyroscopic cube https://www.youtube.com/watch?v=n_6p-1J551Y.
- Full cheetah: Like boston dynamics.

rickstaa commented 3 years ago

Evaluate LAC robustness

@panweihit Let's evaluate the new lac4 and compare it with SAC for multiple environments but now let it train for 1e6 steps:

[ ] CartPoleCost
[x] Oscillator-v1

Oscillator-v1

LAC

Good performance looks better than SAC but worse than LAC4.

Open Report

#### Seed 1250 ##### Performance ![lac_oscillator_long_performance](https://user-images.githubusercontent.com/17570430/115148073-4599a800-a05e-11eb-955d-65d1cdb80d9f.png) ![image](https://user-images.githubusercontent.com/17570430/115148104-64983a00-a05e-11eb-9019-d0b6027f6806.png) ##### Robustness ###### Change K ![image](https://user-images.githubusercontent.com/17570430/115148272-1c2d4c00-a05f-11eb-8632-871b8cf62407.png) ![image](https://user-images.githubusercontent.com/17570430/115148275-20f20000-a05f-11eb-9286-d274048f7ea3.png) ![image](https://user-images.githubusercontent.com/17570430/115148283-2d765880-a05f-11eb-92d9-c5ccfbc1567f.png) ![lac_oscillator_long_soi_k](https://user-images.githubusercontent.com/17570430/115148301-3b2bde00-a05f-11eb-947f-3d691fb78b0e.png) ###### Change c1 ![image](https://user-images.githubusercontent.com/17570430/115148181-bd67d280-a05e-11eb-8f5b-29426e342098.png) ![image](https://user-images.githubusercontent.com/17570430/115148187-c193f000-a05e-11eb-8dc4-2a87cfbca17e.png) ![image](https://user-images.githubusercontent.com/17570430/115148194-c5277700-a05e-11eb-8ac3-2b069a1a2d2b.png) ![lac_oscillator_long_soi](https://user-images.githubusercontent.com/17570430/115148197-c8226780-a05e-11eb-8237-09e9699afee1.png)

LAC 4

Performance and robustness look better than LAC (could still be seeding). It also looks better than SAC.

Open Report

#### Seed 0 ##### Performance ![lac4_long_performance](https://user-images.githubusercontent.com/17570430/115148415-bdb49d80-a05f-11eb-9342-42566050ea40.png) ![image](https://user-images.githubusercontent.com/17570430/115148448-e89ef180-a05f-11eb-95a8-254f777c93d2.png) ##### Robustness ###### Change K ![image](https://user-images.githubusercontent.com/17570430/115148467-08361a00-a060-11eb-912d-63645d6f5655.png) ![image](https://user-images.githubusercontent.com/17570430/115148479-1421dc00-a060-11eb-95db-6b66c0cac36a.png) ![image](https://user-images.githubusercontent.com/17570430/115148491-23088e80-a060-11eb-83fb-c93c2fb374be.png) ![lac4_long_performance_soi_k](https://user-images.githubusercontent.com/17570430/115148577-80044480-a060-11eb-8be3-4727cd3ec83b.png) ###### Change c1 ## Conclusion ![image](https://user-images.githubusercontent.com/17570430/115148749-54ce2500-a061-11eb-9d74-c26ef7d64ed7.png) ![image](https://user-images.githubusercontent.com/17570430/115148753-5861ac00-a061-11eb-8a77-cfc8431c14cb.png) ![image](https://user-images.githubusercontent.com/17570430/115148759-5bf53300-a061-11eb-9b34-0688121d28fc.png) ![lac4_long_soi_c](https://user-images.githubusercontent.com/17570430/115148745-50a20780-a061-11eb-8462-00d4f054cca2.png)

SAC

Performance and robustness look worse than both LAC versions.

Open Report

#### Seed 1250 ##### Performance ![sac_performance](https://user-images.githubusercontent.com/17570430/115148919-e9d11e00-a061-11eb-9a8e-04981c1c0078.png) ![image](https://user-images.githubusercontent.com/17570430/115148951-ff464800-a061-11eb-8c33-af153b432125.png) ##### Robustness ###### Change K ![image](https://user-images.githubusercontent.com/17570430/115149192-47199f00-a063-11eb-9947-65f319c2e36b.png) ![image](https://user-images.githubusercontent.com/17570430/115149194-4b45bc80-a063-11eb-9273-8f5fa59c037b.png) ![image](https://user-images.githubusercontent.com/17570430/115149204-5567bb00-a063-11eb-9ad8-9ebe432c3bee.png) ![sac_sio_K](https://user-images.githubusercontent.com/17570430/115149187-4123be00-a063-11eb-9c92-aa1a0b76b902.png) ###### Change c1 ![image](https://user-images.githubusercontent.com/17570430/115148969-21d86100-a062-11eb-9283-8016951543fc.png) ![image](https://user-images.githubusercontent.com/17570430/115148972-29980580-a062-11eb-9166-e70135383b2e.png) ![image](https://user-images.githubusercontent.com/17570430/115148978-36b4f480-a062-11eb-9d93-6df188439130.png) ![sac_soi_c1](https://user-images.githubusercontent.com/17570430/115148995-47fe0100-a062-11eb-9108-d9e5c7c488fd.png)

Conclusion

The new LAC4 algorithm in which the Lyapunov constraint is placed in the critic cost function and only the min Lyapunov value is used in the Lyapunov constraint works.
It looks like both the performance and robustness are improved compared to lac. More seeds needed to be sure.
For performance (and robustness) training an agent for 3e5 steps looks to be enough in the Oscillator-v1 environment.
LAC performance is similar to SAC but it has a higher robustness..

rickstaa commented 3 years ago

Meeting notes (18-04-2021)

@dds0117 I had a meeting with @panweihit yesterday to discuss the results of the tests above and discuss the continuation of our research. Below you will find the notes to the meeting.

Results discussion

The new LAC algorithm seems to work as good (maybe even better) as the old LAC and SAC algorithms. There are several things we can still investigate:
- We changed the Lyapunov constraint from
```
l_delta = torch.mean(lya_l_ - l1.detach() + self._alpha3 * r)  # See Han eq. 11
```
  to
```
l_delta = torch.mean(lya_l_.min() - l1.detach() + self._alpha3 * r)  # See Han eq. 11
```
  In this new version, only the minimum Lyapunov value of the current best action is used to check if the constraint is violated.
- We also replaced the alpha_3 * r in the equation with a small alpha_3 term (0.0001). This was done to check if this term is vital. From the results above, we can see that the algorithm achieves the same performance even without this term. The alpha_3*r term makes sure that the algorithm is stable in mean cost. This term incorporates some extra information about the system which we are trying to exploit by making our Lyapunov Stability definition more strict. As the algorithm is also robust without this information, this increases the practical relevance of our algorithm since such information might not yet be available for all systems, or the problem might be too hard when using this stricter Lyapunov stability. We as a researcher can use any of the Lyapunov Stability criteriums, (strict) asymptotic stability, exponential stability, (strict) asymptotic stability in mean cost etc. for our algorithm.

Other discussion points

@panweihit pointed me to a very insightful course of MIT given by dr. Russ Tedrake. This course explains as long as your reward is Lyapunov stable (has a decreasing derivative), the system also learns the stable and robust behaviour. I haven't watched the full lecture yet, so that I will update the explanation below later. But here is my current understanding:

This conclusion implies that we don't need to design very complicated stability measures for our robot tasks. A reward that makes sure that the Robot doesn't fall is good enough to ensure stability and robustness. Let's take Boston dynamics spot dog as an example. In this case, we don't need to use a cost function that exploits complicated theoretical stability measures like the Zero-Moment point or the COM being vertically inside the convex hull of its contact points to achieve stable behaviour. According to dr. Russ Tedrake, using such knowledge is merely a bonus. Using a simpler cost function like the perpendicular distance between the robot COM already implicitly encodes the stability. If the robot cannot track this path, it died, so it is learning stable behaviour when our Lyapunov values are always decreasing. This greatly increases how practical our algorithm is since we can now use our algorithm for learning stable/robust behaviour even when theoretical knowledge about the system's stability is not available. For systems where we have such knowledge, we can use it to get an additional bonus.

What do we need to do now

Currently, I'm finishing several experiments to:

[ ] Solve we the CartPole cart is not converging to zero.
[ ] Whether the changes we made to the LAC algorithm really we discussed above achieve better stability/robustness.
- [ ] Run 3 random seeds to see if LAC4 is better than LAC.
[ ] Check the required number of steps to train an agent in the Oscillator and CartPole environments.
- [ ] I train the agents for 1e6 and look where the agent performance and robustness stagnate.
[ ] Check if increasing the episode length of training to 800 instead of 400 improves the LAC performance and robustness in the Oscillator environment.
[ ] Check whether changing the gamma improves the performance and robustness.

I am further adding a value network to the LAC algorithm so that we can replace it with a gaussian process. Replacing it with a gaussian process makes sense since this allows some stochasticity in the value function, making it easier for the agent to train stable behaviour. The discussion is similar in nature to why SAC uses a gaussian actor instead of a deterministic one. Here we now use a stochastic value function instead of a deterministic one. We use a Gaussian process instead of a gaussian network since the value function is convex in nature. I and @panweihit agreed that a Gaussian process would be well able to catch the behaviour while keeping the algorithm simple because of this nature. Your gaussian process will replace the value network of the new LAC algorithm (I will create this algorithm based on the second version of SAC).

new_structure

The next steps for creating the GPL algorithm, therefore, are as follows:

[ ] Add value network to LAC.
[ ] Replace it with a gaussian process.
[ ] Perform experiments.

rickstaa commented 3 years ago

LAC4 Improvements

Take min lypunov target value

@panweihit slightly modified the Lyapunov constraint such that now the minimum Lyapunov value is used in the Lyapunov constrained:

l_delta = torch.mean(lya_l_.min() - l1.detach() + self._alpha3 * r)  # See Han eq. 11

Remove stricter Lyapunov stability

We removed the alpha_3*r term from the Lyapunov constraint.

self._alpha3 = 0.000001 # Small quadratic regulator term to ensure negative definiteness. Without it the derivative can be negative semi definite.
l_delta = torch.mean(lya_l_.min() - l1.detach() + self._alpha3)  # See Han eq. 11

The LAC algorithm trains fine without this.

dds0117 commented 3 years ago

Yes，I agree with you. The Gaussian Process value function is finished，but I meet a problem in which we could use GP value function instead of value function directly. Because gaussian process is related with the temporal sequence during training，it would be used Monte-Carlo update instead of Temporal different（TD）update. I am in trouble with it and we can talk about it tomorrow.

rickstaa commented 3 years ago

@dds0117 Good point. I wasn't aware that it was a Monte-Carlo method.

rickstaa commented 3 years ago

@panweihit, @dds0117 Here is the new model that was trained for the robustness eval of the cart_pole.

lac4_cart_pole_cost_s1250.zip

Robustness eval Instructions

Create conda environment.
Activate conda environment.
Install packages pip install -e .
Put model inside data folder.
Run the following command:

python -m bayesian_learning_control.run eval_robustness ~/Development/work/bayesian-learning-control/data/lac4_cart_pole_cost/lac4_cart_pole_cost_s1250 --disturbance_type=input

See the results

Change the disturbance

To change the disturbance changes the Magnitude inside the DISTURBER_CFG variable in the https://github.com/rickstaa/simzoo/blob/c0f32230f68b7f0353412a848d8b8598cd82d21c/simzoo/common/disturber.py#L61 file.

rickstaa commented 3 years ago

Disussion 11-06-2021

@panweihit, @dds0117 For future reference here a small summary of what we found out in our experimentation yesterday:

The performance and robustness of the LAC, LAC4 looks similar. The performance of SAC is similar but the robustness lower.
The robustness is very much dependent on the actor and critic network structure.
When we used a linear (affine) network structure (i.e. [1], [16]) for the actor the agent was not able to find any rewarding behaviour.

Like we discussed I think the main takeaway is that when we implement the gaussian version of the LAC algorithm it should be able to work when the function approximator, (deep) Gaussian process, is big enough to catch the complexity of the system.

rickstaa commented 2 years ago

Closed since there are more important things to do first.

rickstaa / stable-learning-control

Implement LPG #122

User story

Steps

LAC versions Legenda

Test training Performance (CartPoleCost)

Regular SAC and LAC performance

LAC

SAC

SAC 2

LAC 2

LAC3

LAC4

LAC5

LAC6

Conclusion

Disturbance robustness evaluation (CartPoleCost)

LAC original results

LAC4 results

SAC original results

Disturbance robustness evaluation (Oscillator)

LAC original results

LAC4 results

SAC original results

Meeting notes 17-04-2021

Evaluate LAC robustness

Oscillator-v1

LAC

LAC 4

SAC

Conclusion

Meeting notes (18-04-2021)

Results discussion

Other discussion points

What do we need to do now

LAC4 Improvements

Take min lypunov target value

Remove stricter Lyapunov stability

Robustness eval Instructions

Change the disturbance

Disussion 11-06-2021