simensov / ml4ca

Code base for "Dynamic Positioning using Deep Reinforcement Learning". Paper: https://www.sciencedirect.com/science/article/pii/S0029801821008398 - Thesis: https://ntnuopen.ntnu.no/ntnu-xmlui/handle/11250/2731248
13 stars 4 forks source link

training #19

Open waynezw0618 opened 3 years ago

waynezw0618 commented 3 years ago

Hi Simen I found your paper on Ocean Engineer which is what is very interesting and I am looking for. so I setup some similar thing with a two waterjet units boat in simulation. as mentioned in your paper I select eta in the error frame as well as speeds as the state and define the reward as a function of sum of two Gaussion functions, the shape of which is close to yours. for action, I limit jet propulsion forces and angles within range [-1,1], the scaled to the physics value in the simulation. before training. I did some test for the environment of the boat. I can run turning, zigzagging. for training, I set the boat in some random value of 50 times of length of both out of the origin. But I never get converged results. for each episode, the reward is about the value of the reward at the bound, which is far away from peak. I would appreciate if you can provide some tips and tricks for such a case. besides, would you please let me know how could you estimate the time scales of running and times of actions for each episode.

Best regards Wei

simensov commented 3 years ago

Hello!

I would recommend to check out chapter 5.4 from my thesis to see how I argued for the selection of all training parameters and DRL-hyperparameters: https://ntnuopen.ntnu.no/ntnu-xmlui/handle/11250/2731248

Some general comments though:

Hope that helps!

waynezw0618 commented 3 years ago

Hi Simen Thanks for replying.

simensov commented 2 years ago

My point was that I found it more effective using a multivariate gaussian instead of a sum of two gaussians :) I got better results with the former, and I also found it easier to develop a reward shape which gave a more predictable outcome from the learning.

135 degrees away from the setpoint is still a lot. Have you tried to just reset the vessel position to being within +-20 degrees of the setpoint, and see if that works? If it doesn't work, then it is hard to expect that more difficult problems will be solved. Also, I believe that heading control during DP using only two water jets (in the stern?) is not the easiest task (making the problem underactuated, I presume?). So try and make the learning task as simple as possible, and build from there.

waynezw0618 commented 2 years ago

hello @simensov

Best regard Wei

waynezw0618 commented 2 years ago

Hello @simensov

I did a train with "bow trust" in the simulator. I got a sudden increase of reward from 100 to 200 per episode.. for my case the max reward of each step should be 2, Since I have 400 steps per episode. so I would expect something like <800, but 200 could be too small?

would you please take a look at the plots from TensorBoard to see whether what I can do to make improvement. WechatIMG20478

Best regards Wei

simensov commented 2 years ago

Hi

waynezw0618 commented 2 years ago

Hi

simensov commented 2 years ago
waynezw0618 commented 2 years ago
simensov commented 2 years ago