training - Githubissues

simensov / ml4ca

Code base for "Dynamic Positioning using Deep Reinforcement Learning". Paper: https://www.sciencedirect.com/science/article/pii/S0029801821008398 - Thesis: https://ntnuopen.ntnu.no/ntnu-xmlui/handle/11250/2731248

13 stars 4 forks source link

training #19

Open waynezw0618 opened 3 years ago

waynezw0618 commented 3 years ago

Hi Simen I found your paper on Ocean Engineer which is what is very interesting and I am looking for. so I setup some similar thing with a two waterjet units boat in simulation. as mentioned in your paper I select eta in the error frame as well as speeds as the state and define the reward as a function of sum of two Gaussion functions, the shape of which is close to yours. for action, I limit jet propulsion forces and angles within range [-1,1], the scaled to the physics value in the simulation. before training. I did some test for the environment of the boat. I can run turning, zigzagging. for training, I set the boat in some random value of 50 times of length of both out of the origin. But I never get converged results. for each episode, the reward is about the value of the reward at the bound, which is far away from peak. I would appreciate if you can provide some tips and tricks for such a case. besides, would you please let me know how could you estimate the time scales of running and times of actions for each episode.

Best regards Wei

simensov commented 3 years ago

Hello!

I would recommend to check out chapter 5.4 from my thesis to see how I argued for the selection of all training parameters and DRL-hyperparameters: https://ntnuopen.ntnu.no/ntnu-xmlui/handle/11250/2731248

Some general comments though:

Using angles in the action vector can be troubling if the jets actually can go all 360 deg around as it is hard to represent circular continuity along the real number line. See section 5.4.3 of my thesis for an explanation on why I chose to output the sine and cosine of the pod angles instead! Note that if your jets don't rotate fully, then I don't believe it should be a problem to do what you are doing right now.
I did not use the sum of two Gaussian function as reward, I actually argued against doing that. Rather, I used a multivariate Gaussian function: see chapter 5.4.4 in the thesis where I compared the sum of several Gaussians with one multivariate Gaussian!
I suspect that you should set the random initial position to be a lot closer to the origin in the beginning, just to see if you are making any progress. I started with position within a circle with radius of only 2.5 times vessel length, and an angle of 45 deg away from the setpoint.
For the time scales, see section 5.4.8 in the thesis. Basically, I used a conventional controller in a base case where I had it respond to a step change in the setpoint. By measuring the time it took for that controller to react, I had an estimate of the dynamics of the dynamic positioning task, and allowed me to understand how long each episode had to be as a minimum - you need to allow episodes to be long enough for the vessel to actually reach the goal state, if not you will have a hard time solving the task with DRL. Once I had that time (80 seconds I believe), I just had to account for the simulator time step being 5 Hz or dt = 0.2 s. This gave episode lengths of 80 s * 5 = 400 :) This also helped me in deciding on the discount factor gamma.

Hope that helps!

waynezw0618 commented 3 years ago

Hi Simen Thanks for replying.

I am using exact the same action as yours and arctan2 function to get angle
And I am using the sum of two multivariable gaussian function , which gives a similar shape of the function like yours
I did a try. the setpoint is the origin, and the initial position is at -135 degree with about 3 times of boat length away from the setpoint. but I can not find an improvement of the reward during training. it keeps in the lowest level. the reward keeps at value of the state bound. I cannot find the reason why the training stick to the bad values. shall I use longer distance? I suppose it is hard to get the S-type trajectory due to the inertial. so a closer initial position makes challenge, shouldn't be ?

simensov commented 2 years ago

My point was that I found it more effective using a multivariate gaussian instead of a sum of two gaussians :) I got better results with the former, and I also found it easier to develop a reward shape which gave a more predictable outcome from the learning.

135 degrees away from the setpoint is still a lot. Have you tried to just reset the vessel position to being within +-20 degrees of the setpoint, and see if that works? If it doesn't work, then it is hard to expect that more difficult problems will be solved. Also, I believe that heading control during DP using only two water jets (in the stern?) is not the easiest task (making the problem underactuated, I presume?). So try and make the learning task as simple as possible, and build from there.

waynezw0618 commented 2 years ago

hello @simensov

for reward, what I am using is here
in section 5.4.8, you said "one episode could be no longer than 400 time steps, ... the minibatch size ... should be at least 2000. This lead to that, if ... , the parameter updates were done with 5 trajectories of size 400 time steps." but in Table 5.4 you set Max time steps per epoch = 1600. is there anything I missed here? I suppose batch_size = 2000 timeSteps or 5 episodes. is it different from Max time steps per epoch ? I can only find https://github.com/simensov/ml4ca/blob/e2e75f3785455a29faa92881870605590e07425f/src/rl/windows_workspace/train.py#L42-L44 I suppose 400 should be for max_ep_len and --epoch is Number of epochs of 1500 in Table 5.4. But I don't know whether --steps is of the Max time steps per epoch in table 5.4 or 2000 of 5 trajectories. can you tell me which are the values for max time steps, batch_size, Max time steps per epoch?
in section 5.4.10, you said " Since the largest instantaneous reward achievable from Rtot in Equation (5.18) was r∗ = 3.5, the maximum episodal return was 1400 over the possible 400 time steps within each episode. ". and from figure 5.8(a) this maximum episodal return is about 1200. but shall we take the discount into consideration which makes the value much smaller?
how about reset. shall we reset the initial value during training every episode or every epoch?
In the following line, you update the reference new_ref for calculating the $\eta$ in error frame https://github.com/simensov/ml4ca/blob/e2e75f3785455a29faa92881870605590e07425f/src/rl/windows_workspace/specific/customEnv.py#L131 how to get the new_ref. I mean are they based on time as they are the function of time? then it is a strong constrain to the velocity, or are they based on the location, I mean get y from x. which do you think is the most suitable ?
for water jets, the problem is as you said underactuated, But I can add an artificial bow trust in the simulator as a beginning case. We will see whether it is a good start for simulator

Best regard Wei

waynezw0618 commented 2 years ago

Hello @simensov

I did a train with "bow trust" in the simulator. I got a sudden increase of reward from 100 to 200 per episode.. for my case the max reward of each step should be 2, Since I have 400 steps per episode. so I would expect something like <800, but 200 could be too small?

would you please take a look at the plots from TensorBoard to see whether what I can do to make improvement. WechatIMG20478

Best regards Wei

simensov commented 2 years ago

The 3D-plot of the reward function has me a bit confused. It does indeed resemble the multivariate Gaussian, but why is it pointy in the origin? It should be smooth.
The default values set in the code are only default values - when I trained, I used the values given in the thesis.
I believe you are confusing the maximum obtainable return given the time steps that I chose, and the maximum value (which does take discount into consideration). See 5.4.10, paragraph 3.
Pose is to be reset every time an episode terminates.
new_ref was only used when I wanted to test the ability to do DP on the windows computer before loading the neural networks onto the control system in Ubuntu. Don't worry about it - that value was set manually
It is very difficult to interpret training plots without knowing all the details of the implementation. It does look like the training is getting better suddenly, yes. But from the lower right image - why don't you increase number of epochs? It looks like returns are increasing, but you stop training? And from the upper left - why is the agent's entropy increasing? It should be decreasing - the agent should become more and more confident (have less noise) in its decision making.

waynezw0618 commented 2 years ago

the shape of the rewards look weird due to the resolution of plotting, I suppose.
what is the reference. I mean is it an array or an explicit function. I suppose you are using some simple spring-damping model to determine the
I see real_ss_bounds, how could you determine this values? because this values may somehow correlated to the bounds of action and states, is that right? beside, I saw here for my set point case, this real_ss_bounds leads to reset for many during training. I mean real_ss_bounds somehow limit the training, one may benefit from that right?

simensov commented 2 years ago

The reference is the setpoint pose, given as an array of [North, East, Heading]. This should not be confused with a "reference model" which does indeed mimic the spring-mass-damper, but that is implemented in the guidance system. My DRL-model (and the code in this repo) was only used as a control system, and during training on the Windows OS, no reference model was used. This means that all changes in setpoint was given as step changes. However, during testing on the entire control system architecture in Ubuntu, the setpoints given to the control system came from the guidance system, generated by a reference model.
That variable name is to be read as "state space bounds" - within these bounds is where the training is allowed to commence. No correlation to actions - they have their own bounds in self.real_action_bound Remember that the state space is in the error frame, so I chose a distance away from the setpoint (being represented as [0, 0, 0] in the error frame, always) that I assumed that DP was realistic to be within. Attention was also given to how fast the reference model had been tuned to be on the full system, as I had to make sure of that the error between the reference model's commanded pose, and the actual pose, was larger than the state space bounds that the DRL model had been trained on. But this was easily verified, and gave good results.
Remember that in the beginning of training, the episodes will terminate very quickly. This only means that there will be a lot more full trajectories to use as minibatch data for network updates. After learning startes, the episodes will become longer, but since minibatch size is set constant, the number of trajectories will fall. So it is on purpose (and I guess a necessity) to limit the training in order to guide the agent.

waynezw0618 commented 2 years ago

do you mean you do setpoint case during training with only one reference as an array of [North, East, Heading]. during each step you update the pose in NED-frame, and calculate the pose in error frame based on the only reference. or the you have a list of array of [[North_0, East_0, Heading_0],[North_1, East_1, Heading_1],...,[North_N, East_N, Heading_N]] and update the pose in error frame based on [North_i, East_i, Heading_i] of the i_th timestep?
for testing, like four points case you do segment by segment and one reference for each segment or guidance system produce reference as a list of array for one segment?

simensov commented 2 years ago

I am sorry, I don't understand the first question. I generate new setpoints in NED-frame each episode pseudorandomly.
The guidance system outputs the reference to the control system. The input to the guidance system is the setpoint, and the guidance system's mass-spring-damper representation ensures a smooth transition between references that are sent to the control system. I did not use guidance system during training, only during testing on board the full system (after training was done in the Windows environment)