Shaky actions with custom environment

Dobid commented 2 months ago

Hello, thanks for putting your code in open-source !

I'm currently working towards making TDMPC2 work for orientation control of a fixed-wing UAV. I've noticed that the resulting actions are oscillating a lot, almost alernating very fast between (-1, 1). While resulting in good overall error tracking of my orientation angles, such a policy isn't very energy efficient, let's say for a future port on a real UAV.

Did you encounter this kind of problem with your environments ?

nicklashansen commented 2 months ago

This is a common challenge with RL policies unfortunately, and TD-MPC2 is no exception. A few things that I have found to help empirically when deploying our learned policies on real hardware or in other domains where smooth behavior is desirable: 1) demonstrations or offline data that biases the initial data distribution towards your desired behavior (example: MoDem-V2 https://arxiv.org/abs/2309.14236), 2) a reward penalty for large actions reduces unnecessary oscillation but can be difficult to tune, 3) e.g. a low-pass filter can be applied as an intermediate layer between policy and hardware; this should work as long as it is applied during both training and inference. I hope this helps!

Dobid commented 2 months ago

I see, thanks a lot for the references.

In the meantime, I noticed that when I execute a training with the cfg.mpc=false, to only use the policy prior network for executing actions, I was able to achieve satisfactory results in terms of error tracking and action oscillation without modifying anything in my reward function. So I have 2 questions: 1) What is the "pure" RL algorithm ? the policy update resembles to the SAC one, can I assume I'm using SAC when setting the cfg.mpc flag to false ? 2) At evaluation time, when I turn cfg.mpc=true in order to use the MPPI planning layer on top of the NN models learned by an "RL only" training, I get oscillatory actions. Is MPPI planning known for injecting noise on top of a non-oscillating RL policy prior ?

Have a good day!

nicklashansen commented 2 months ago

Afaik there is no direct model-free equivalent at the moment. The policy update is similar but not equivalent to SAC, same goes for the value update. Whether you benefit from planning or not is highly problem dependent; I find that planning helps the most when action spaces are very high-dimensional!
The learned policy is a Gaussian policy parameterized by a single neural network, so output actions tend to be highly correlated in time. MPPI uses sampling which may produce comparably more diverse (or "oscillating") actions every time step since we do not constrain the sampled actions at time t+1 to be close to those at time t. And we do not constrain planned actions to be "close" to the policy prior either, since we may not know a priori whether the policy is a good prior or not (the MoDem-V2 reference above is a good example of this).

I hope this helps!

Dobid commented 2 months ago

Thanks for your insighful reply. However I'd like to clarify the concept of time correlation and its impact on the outputted actions.

The learned policy is a Gaussian policy parameterized by a single neural network, so output actions tend to be highly correlated in time.

Just to make sure I got your point correclty. Isn't time correlation of actions a property of MDPs and dynamical systems in general ? The way you phrased this sentence makes me think you're saying that the nature of the policy being a network predicting the parameters of a gaussian distribution leads to time correlated actions. This logical link doesn't seem trivial to me, would you care to elaborate ?

How does time correlated actions induce "smooth" action outputs ? I've tried PPO on my problem and despite PPO's policy also being a parametrized Gaussian, I keep on running into this oscillatory policy. So I would conclude that the emergence of such an oscillating behaviour is not only related to time correlation... So I'm curious about what makes your learning algo output smooth actions. I would probably need to try vanilla SAC on my problem first since it's the closest model-free algorithm to yours and see if I can reproduce the same non-oscillatory behaviour.
Do you think constraining the sampled actions for creating the imagined trajectories would be beneficial, do you see any pitfalls ? Is the time-uncorrelated nature of actions necessary for MPPI to work ? It probably could be detrimental to exploration, I guess.

Thanks for your time!

nicklashansen commented 2 months ago

The way you phrased this sentence makes me think you're saying that the nature of the policy being a network predicting the parameters of a gaussian distribution leads to time correlated actions

I think the argument here is that neural networks tend to produce similar outputs (in this case Gaussian distributions) for similar inputs, at least compared to a sampling approach (MPPI).

I've tried PPO on my problem and despite PPO's policy also being a parametrized Gaussian, I keep on running into this oscillatory policy. So I would conclude that the emergence of such an oscillating behaviour is not only related to time correlation... So I'm curious about what makes your learning algo output smooth actions. I would probably need to try vanilla SAC on my problem first since it's the closest model-free algorithm to yours and see if I can reproduce the same non-oscillatory behaviour.

Afaik all RL algorithms (TD-MPC, PPO, SAC, DDPG etc.) have this problem unless you explicitly mitigate it.

Do you think constraining the sampled actions for creating the imagined trajectories would be beneficial, do you see any pitfalls ?

You can try :-) Constraining the sampling might change the solution space / decrease reward but that's kind of intentional in that case!

Dobid commented 1 month ago

Thank you for your explanations! 2 extra questions:

Just trying to get a good grasp on the algorithm and I'm wondering why having a neural net estimating the reward function $R_\theta$ is beneficial to the algorithm ? The reward is supposed to be a function designed by the user therefore having a closed form, so I don't really get what's the idea behind it compared to using the ground-truth reward function.
This is a bit off-topic, so let me know if you prefer that I open a new issue. In your implementation of the planner, I noticed that you select the action to be performed by sampling one action sequence according to their respective score among the action sequence elite_actions. And then taking the first action of this sampled action sequence. Whereas in the paper, the action is being sampled from the gaussian distribution with the mean and std directly coming from the empirical estimate update rule. Intuitevely, both methods appear similar and make sense. So, why action sampling should be different depending on if we're selecting the action to perform, or the sampling actions for the imagined rollouts ?

https://github.com/nicklashansen/tdmpc2/blob/5f6fadec0fec78304b4b53e8171d348b58cac486/tdmpc2/tdmpc2.py#L164-L171

nicklashansen / tdmpc2

Shaky actions with custom environment #26