hill-a / stable-baselines

A fork of OpenAI Baselines, implementations of reinforcement learning algorithms
http://stable-baselines.readthedocs.io/
MIT License
4.14k stars 723 forks source link

[question] RL algorithms for continuous action spaces (DDPG and TD3) unable to eliminate steady-state error in custom environment. #963

Open wilsonsamarques opened 4 years ago

wilsonsamarques commented 4 years ago

Hi, I'm using stable baselines implementation of the algorithms TD3 (previously I was using DDPG) to solve a control problem (the attitude control of a satellite), so I created a custom environment following the OpenAI gym standard and the simulation runs just fine. Actually, the algorithm is converging and the trained model is able to stabilize the system very well (even when I change the inertia of the satellite), it can even adapt better than a classical PD controller, which is exactly the result I'm looking for. However, the trained neural network is incapable of eliminating the steady-state error, as shown in the figure below.

pointing_error

I would like to know if you have had a similar problem when applying these algorithms for controlling a dynamic system and if you could give me any tips about the proper use of the algorithms. I have tested different variations of cost functions and even hyperparameter search using Optuna , as you recommend in the Docs, but still, I could not eliminate this steady-state error.

Any advice is welcome. :-)

araffin commented 4 years ago

Hello,

However, the trained neural network is incapable of eliminating the steady-state error, as shown in the figure below.

Correct me if I'm wrong, but a PD controller cannot remove the steady state error (unless you have a PID), so I would include an history (or something similar) of observations/actions as input (we have a wrapper for that in the rl zoo) to give the controller the same information as for the PID.

See https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/utils/wrappers.py#L134 (from SB3 zoo).

(don't look at the docstring, it was copy-pasted from the delayed reward wrapper)

wilsonsamarques commented 4 years ago

Normally yes, but the satellite already has a double-integrator dynamics, so it is not necessary to add an integral term as in a classical PID. Most satellites are controlled with just a PD controller. But your idea is correct, the solution should be to add some form of integral action in this case. Thanks for the suggestion! I have not seen this wrapper yet. I will try it out!

I have another question. I have read some works of people having a similar problem (for example, https://ai.stackexchange.com/questions/18567/continuous-control-with-ddpg-how-to-eliminate-steady-state-error/21435#21435) and their solution was to add an integral term in the cost function. But it does not make a lot of sense to me, because in RL we already kind of have this integral term built-in in the cost, since we maximize the accumulated reward in each episode, right?