about reward - Githubissues

lyly0309 commented 3 years ago

Hi, @simensov I am very interested in your thesis, and I am also doing the DRL training for one AUV. but it has only two propellers installed in the stern, that is , it is an underactuated ship. I want the vessel moves following a reference path predesigned in a Guidance system. The reward shape I use is the Gauss Reward shape that is same as in your paper. But after training, I found the vessel path seems don't consistent with the reference path. Do you have some suggestion about my issue? Do I need to revise my reward shape? The plot below is the result(the yellow line is reference path and the blue line is the trained path Figure 2021-09-07 104458 )

simensov commented 3 years ago

Hi! A very interesting problem indeed. I love underactuated robotics - one of the coolest subjects I ever studied. I recommend you check out there awesome lecture notes about the topic in general https://underactuated.mit.edu/

I chose DRL for the dynamic positioning task because the problem is actually overactuated, meaning that there might be plenty solutions at any given time that meets e.g. force requirements. Instead of adding lots of logic to decide on a good solution for an optimization algorithm, I wanted the system to "learn" it by itself - meaning by maximizing return / reward over time.

For underactuated systems, I believe that you don't really face the same issue of having to add additional constraints / criteria in order to choose between several solutions - why are you choosing DRL for this problem? Nonlinear dynamics?

I can come to think of a good paper that might be relevant - https://ntnuopen.ntnu.no/ntnu-xmlui/handle/11250/2596760. Here, curved path following was accomplished of a surface vessel, using constant speed and training a control law to steer the rudder in order to track a reference path. You could try a similar concept - combining the two propellers (which are non-rotatable pods I assume since the problem is stated to be underactuated) to do differential steering which "simulates" having a rudder. If you want to use constant speed, then you would only have one control output - the "rudder angle", defined by a thrust difference between the two propellers.

There are so many things that could affect your outcome. I would have to see a lot more information about the system setup, reward shape / implementation / timesteps taken during training etc. etc. before I can say anything - DRL is hard, and needs quite a lot of "engineering" to work properly.

Do you have any plots of the development of reward over time during training? Does it seem to converge?
Do you know if the physical limitations of the thrusters might be the reason that the vessel does not reach the trajectory (is the path outputted by the guidance system too aggressive?)
From your image, the actual trajectory has some similarity to the reference (indicating some result of the training?), however the error seems to become larger and larger. Have you tried to simply increase the number of timesteps taken during training (train for longer)?

Hope it might help.

simensov commented 3 years ago

Hi again,

I can see that you shared you repository with me. Unfortunately, I don't have the capacity to look through your code. Please try some of the ideas mentioned above and see if you experience improvements :)

lyly0309 commented 3 years ago

hi. @simensov Thanks for your prompt reply. It really helps me a lot.

Firstly, thanks very much for sharing the MIT lectures. I will take time reading them.

I choose this simply because, normally controller of PID is not easy to use, I am looking for the advanced technique to find some solution. and this is a fully nonlinear dynamic system with strong contrain from the boat inertial. So normally it is very difficult to control a real boat/ship. I have read the paper (https://ntnuopen.ntnu.no/ntnu-xmlui/handle/11250/2596760) you recommended and it is very relevant. In my work, we indeed define two propellers to do differential steering which "simulates" having a rudder. But I don’t restrict forward speed. Maybe later I can try to limit the forward speed and only a thrust difference between the two propellers is changed.

For your final comments: • Do you have any plots of the development of reward over time during training? Does it seem to converge? I once have one. but since dumping the log file takes time, so normally I prefer to observe reward during training. True to be told, what I normally face is oscilation rather than convergency. since these number keeps oscillating WITHOUT any trend of convergence. • Do you know if the physical limitations of the thrusters might be the reason that the vessel does not reach the trajectory (is the path outputted by the guidance system too aggressive?) I can not say no for this question. sometimes the guidance system provide me some sharp turning which is almost impossible. some for training, I limit the phase angle to certain range to avoid ths kind of werid path. and the most challenge thing is that seem the inertial is a strong limitation to such case. since once one observe deviation, it can never turns back. • From your image, the actual trajectory has some similarity to the reference (indicating some result of the training?), however the error seems to become larger and larger. Have you tried to simply increase the number of timesteps taken during training (train for longer)? When I increase the number of timesteps today, I still have the same problem. The error is larger and larger. it is so confusing!!

simensov commented 3 years ago

I believe that it could be smart to limit the forward speed (or at least the sum of the thrust to be constant) so that you reduce the dimensionality in your action vector first - for comparison, my first test for dynamic positioning, I restricted all the angles of the rotatable propellers to be constant, and only had 3 action vector elements, being the thrust from each thruster in [-1 ,1]. Once I saw that was working, I expanded to outputting the angles as well. Make sure you are able to solve easier tasks first, then go on to the more complex ones!

I would monitor the reward plots and use that to analyze what is working and not. And write down a log of all your hyperparameters between each run so that you see a trend in what is working and not.

If the guidance system is not working as you want - then how do you expect your controller to be able to solve the task? We need to give the controller a possibility to succeed! I would increase the time constant (or something) in the guidance system so that you make sure that the controller's task is actually physically achievable.

simensov / ml4ca

about reward #18