Open waynezw0618 opened 3 years ago
Hello!
I would recommend to check out chapter 5.4 from my thesis to see how I argued for the selection of all training parameters and DRL-hyperparameters: https://ntnuopen.ntnu.no/ntnu-xmlui/handle/11250/2731248
Some general comments though:
Hope that helps!
Hi Simen Thanks for replying.
arctan2
function to get angleMy point was that I found it more effective using a multivariate gaussian instead of a sum of two gaussians :) I got better results with the former, and I also found it easier to develop a reward shape which gave a more predictable outcome from the learning.
135 degrees away from the setpoint is still a lot. Have you tried to just reset the vessel position to being within +-20 degrees of the setpoint, and see if that works? If it doesn't work, then it is hard to expect that more difficult problems will be solved. Also, I believe that heading control during DP using only two water jets (in the stern?) is not the easiest task (making the problem underactuated, I presume?). So try and make the learning task as simple as possible, and build from there.
hello @simensov
for reward, what I am using is here
in section 5.4.8, you said "one episode could be no longer than 400 time steps, ... the minibatch size ... should be at least 2000. This lead to that, if ... , the parameter updates were done with 5 trajectories of size 400 time steps." but in Table 5.4 you set Max time steps per epoch = 1600. is there anything I missed here? I suppose batch_size = 2000 timeSteps or 5 episodes. is it different from Max time steps per epoch ? I can only find https://github.com/simensov/ml4ca/blob/e2e75f3785455a29faa92881870605590e07425f/src/rl/windows_workspace/train.py#L42-L44
I suppose 400 should be for max_ep_len
and --epoch
is Number of epochs of 1500 in Table 5.4. But I don't know whether --steps
is of the Max time steps per epoch in table 5.4 or 2000 of 5 trajectories. can you tell me which are the values for max time steps, batch_size, Max time steps per epoch?
in section 5.4.10, you said " Since the largest instantaneous reward achievable from Rtot in Equation (5.18) was r∗ = 3.5, the maximum episodal return was 1400 over the possible 400 time steps within each episode. ". and from figure 5.8(a) this maximum episodal return is about 1200. but shall we take the discount into consideration which makes the value much smaller?
how about reset. shall we reset the initial value during training every episode or every epoch?
In the following line, you update the reference new_ref
for calculating the in error frame https://github.com/simensov/ml4ca/blob/e2e75f3785455a29faa92881870605590e07425f/src/rl/windows_workspace/specific/customEnv.py#L131 how to get the new_ref
. I mean are they based on time as they are the function of time? then it is a strong constrain to the velocity, or are they based on the location, I mean get y from x. which do you think is the most suitable ?
for water jets, the problem is as you said underactuated, But I can add an artificial bow trust in the simulator as a beginning case. We will see whether it is a good start for simulator
Best regard Wei
Hello @simensov
I did a train with "bow trust" in the simulator. I got a sudden increase of reward from 100 to 200 per episode.. for my case the max reward of each step should be 2, Since I have 400 steps per episode. so I would expect something like <800, but 200 could be too small?
would you please take a look at the plots from TensorBoard to see whether what I can do to make improvement.
Best regards Wei
Hi
Hi
real_ss_bounds
, how could you determine this values? because this values may somehow correlated to the bounds of action and states, is that right? beside, I saw here for my set point case, this real_ss_bounds
leads to reset for many during training. I mean real_ss_bounds
somehow limit the training, one may benefit from that right?self.real_action_bound
Remember that the state space is in the error frame, so I chose a distance away from the setpoint (being represented as [0, 0, 0] in the error frame, always) that I assumed that DP was realistic to be within. Attention was also given to how fast the reference model had been tuned to be on the full system, as I had to make sure of that the error between the reference model's commanded pose, and the actual pose, was larger than the state space bounds that the DRL model had been trained on. But this was easily verified, and gave good results.
Hi Simen I found your paper on Ocean Engineer which is what is very interesting and I am looking for. so I setup some similar thing with a two waterjet units boat in simulation. as mentioned in your paper I select eta in the error frame as well as speeds as the state and define the reward as a function of sum of two Gaussion functions, the shape of which is close to yours. for action, I limit jet propulsion forces and angles within range [-1,1], the scaled to the physics value in the simulation. before training. I did some test for the environment of the boat. I can run turning, zigzagging. for training, I set the boat in some random value of 50 times of length of both out of the origin. But I never get converged results. for each episode, the reward is about the value of the reward at the bound, which is far away from peak. I would appreciate if you can provide some tips and tricks for such a case. besides, would you please let me know how could you estimate the time scales of running and times of actions for each episode.
Best regards Wei