Closed rickstaa closed 2 years ago
Let's first test the training performance of the following LAC versions in the CartPoleCost environment:
Let's also quickly investigate the following SAC versions:
Experiment file: experiments/gpl_2021/lac_cart_pole_cost.yml
.
As we already know, LAC works.
Experiment file: experiments/gpl_2021/sac_cart_pole_cost.yml
.
As we already know, SAC can also perform on the CartPoleCost environment.
Experiment file: experiments/gpl_2021/sac2_cart_pole_cost.yml
.
Seems to work fine.
Experiment file: experiments/gpl_2021/lac2_cart_pole_cost.yml
.
Also works.
Experiment file: experiments/gpl_2021/lac3_cart_pole_cost.yml
.
Also works.
Experiment file: experiments/gpl_2021/lac4_cart_pole_cost.yml
.
Also works but after this first test, it looks like performance is worse. This could also be due to random factors.
Experiment file: experiments/gpl_2021/lac5_cart_pole_cost.yml
.
Works as expected.
Experiment file: experiments/gpl_2021/lac6_cart_pole_cost.yml
.
Works as expected.
All algorithms are able to train. For simplicity let's first work with LAC4 as we can make the other changes later. For this algorithm, we should look at the robustness against disturbances with the original LAC algorithm
Seems to work fine
Seems to give the same results as the original lac.
Like in Han et al. 2020 the robustness is lower than the LAC algorithm. Related to that the algorithm also has a higher deadrate.
Seems to give the same results as the original lac.
alpha3*R
term can be dropped and a simple alpha3
term can be used. This results in a softer version of Lyapunov stability (derivative is less negative), but this version can be used to make any cost function stable in the sense of Lyapunov (more practical).@panweihit Let's evaluate the new lac4 and compare it with SAC for multiple environments but now let it train for 1e6
steps:
Good performance looks better than SAC but worse than LAC4.
Performance and robustness look better than LAC (could still be seeding). It also looks better than SAC.
Performance and robustness look worse than both LAC versions.
3e5
steps looks to be enough in the Oscillator-v1 environment. @dds0117 I had a meeting with @panweihit yesterday to discuss the results of the tests above and discuss the continuation of our research. Below you will find the notes to the meeting.
l_delta = torch.mean(lya_l_ - l1.detach() + self._alpha3 * r) # See Han eq. 11
to
l_delta = torch.mean(lya_l_.min() - l1.detach() + self._alpha3 * r) # See Han eq. 11
In this new version, only the minimum Lyapunov value of the current best action is used to check if the constraint is violated.
alpha_3 * r
in the equation with a small alpha_3
term (0.0001). This was done to check if this term is vital. From the results above, we can see that the algorithm achieves the same performance even without this term. The alpha_3*r
term makes sure that the algorithm is stable in mean cost. This term incorporates some extra information about the system which we are trying to exploit by making our Lyapunov Stability definition more strict. As the algorithm is also robust without this information, this increases the practical relevance of our algorithm since such information might not yet be available for all systems, or the problem might be too hard when using this stricter Lyapunov stability. We as a researcher can use any of the Lyapunov Stability criteriums, (strict) asymptotic stability, exponential stability, (strict) asymptotic stability in mean cost etc. for our algorithm.@panweihit pointed me to a very insightful course of MIT given by dr. Russ Tedrake. This course explains as long as your reward is Lyapunov stable (has a decreasing derivative), the system also learns the stable and robust behaviour. I haven't watched the full lecture yet, so that I will update the explanation below later. But here is my current understanding:
This conclusion implies that we don't need to design very complicated stability measures for our robot tasks. A reward that makes sure that the Robot doesn't fall is good enough to ensure stability and robustness. Let's take Boston dynamics spot dog as an example. In this case, we don't need to use a cost function that exploits complicated theoretical stability measures like the Zero-Moment point or the COM being vertically inside the convex hull of its contact points to achieve stable behaviour. According to dr. Russ Tedrake, using such knowledge is merely a bonus. Using a simpler cost function like the perpendicular distance between the robot COM already implicitly encodes the stability. If the robot cannot track this path, it died, so it is learning stable behaviour when our Lyapunov values are always decreasing. This greatly increases how practical our algorithm is since we can now use our algorithm for learning stable/robust behaviour even when theoretical knowledge about the system's stability is not available. For systems where we have such knowledge, we can use it to get an additional bonus.
Currently, I'm finishing several experiments to:
I am further adding a value network to the LAC algorithm so that we can replace it with a gaussian process. Replacing it with a gaussian process makes sense since this allows some stochasticity in the value function, making it easier for the agent to train stable behaviour. The discussion is similar in nature to why SAC uses a gaussian actor instead of a deterministic one. Here we now use a stochastic value function instead of a deterministic one. We use a Gaussian process instead of a gaussian network since the value function is convex in nature. I and @panweihit agreed that a Gaussian process would be well able to catch the behaviour while keeping the algorithm simple because of this nature. Your gaussian process will replace the value network of the new LAC algorithm (I will create this algorithm based on the second version of SAC).
The next steps for creating the GPL algorithm, therefore, are as follows:
@panweihit slightly modified the Lyapunov constraint such that now the minimum Lyapunov value is used in the Lyapunov constrained:
l_delta = torch.mean(lya_l_.min() - l1.detach() + self._alpha3 * r) # See Han eq. 11
We removed the alpha_3*r
term from the Lyapunov constraint.
self._alpha3 = 0.000001 # Small quadratic regulator term to ensure negative definiteness. Without it the derivative can be negative semi definite.
l_delta = torch.mean(lya_l_.min() - l1.detach() + self._alpha3) # See Han eq. 11
The LAC algorithm trains fine without this.
Yes,I agree with you. The Gaussian Process value function is finished,but I meet a problem in which we could use GP value function instead of value function directly. Because gaussian process is related with the temporal sequence during training,it would be used Monte-Carlo update instead of Temporal different(TD)update. I am in trouble with it and we can talk about it tomorrow.
@dds0117 Good point. I wasn't aware that it was a Monte-Carlo method.
@panweihit, @dds0117 Here is the new model that was trained for the robustness eval of the cart_pole.
See also https://rickstaa.github.io/bayesian-learning-control/control/eval_robustness.html.
pip install -e .
python -m bayesian_learning_control.run eval_robustness ~/Development/work/bayesian-learning-control/data/lac4_cart_pole_cost/lac4_cart_pole_cost_s1250 --disturbance_type=input
To change the disturbance changes the Magnitude inside the DISTURBER_CFG
variable in the https://github.com/rickstaa/simzoo/blob/c0f32230f68b7f0353412a848d8b8598cd82d21c/simzoo/common/disturber.py#L61 file.
@panweihit, @dds0117 For future reference here a small summary of what we found out in our experimentation yesterday:
Like we discussed I think the main takeaway is that when we implement the gaussian version of the LAC algorithm it should be able to work when the function approximator, (deep) Gaussian process, is big enough to catch the complexity of the system.
Closed since there are more important things to do first.
User story
As discussed in the meeting, we want to implement the LPG agent. @panweihit @dds0117 in this report, I will track the progress of this new algorithm.
Steps
[x] 1. Check if the regular SAC and LAC algorithms train on the CartPole environments.
[x] 2. Remove the Lyapunov constraint and check if the agent can train on the CartPoleCost environment (lac2).
[x] 3. Move Lyapunov constraint from the actor loss to the critic loss function (lac4).
[ ] 4. Replace the Q values in the critic loss with the actual value function.
[ ] 5. Approximate the value function with a Gaussian Process.
LAC versions Legenda
LAC: Regular lag.
LAC2: Lac without any lyapunov constrainted (Similar to SAC but with Squared output activation).
LAC3: Lac but now with the double Q-trick added.
LAC4: Lac but not the lyapunov constrained is added to the critic loss instead of the actor loss.
LAC5: Lac but now we also add the entropy regularization term in to the critic (more theoretically correct).
LAC6: Lac but now the Lagrance multipliers are optimized before they are used to optimize the critic and actor.
LAC7: Lac but now we use the minimum lyapunov target in the Lyapunov constrained.
LAC8: Lac but now we replace the strict asymptotic stability in mean cost with general asymptotic stability.
SAC: Regular sac.
SAC2: Sac but without double Q-trick.
SAC3: Sac but now it uses v1 of Haarnoja et al. 2019.
SAC4: Sac but now it uses v2 of Haarnoja et al. 2019.