reiniscimurs / DRL-robot-navigation

Deep Reinforcement Learning for mobile robot navigation in ROS Gazebo simulator. Using Twin Delayed Deep Deterministic Policy Gradient (TD3) neural network, a robot learns to navigate to a random goal point in a simulated environment while avoiding obstacles.
MIT License
487 stars 97 forks source link

When the robot is approaching the target point, it will stop there and keep rotating left and right #80

Closed reiniscimurs closed 3 months ago

reiniscimurs commented 8 months ago
          Hello, during the evaluation, when the robot is approaching the target point, it will stop there and keep rotating left and right, but it does not reach the target point, but when it is not in the evaluation period, it seems that there is no such problem. What is the situation?

Originally posted by @pzhabc in https://github.com/reiniscimurs/DRL-robot-navigation/issues/79#issuecomment-1790372961

reiniscimurs commented 8 months ago

@pzhabc

Please open new issues for problems you are facing if there is no open related issue.

Please provide full information about the problem you are facing and what steps you have taken to install the repo, what changes you have made (if any) and what have you tried to solve the issue.

pzhabc commented 8 months ago

@pzhabc

Please open new issues for problems you are facing if there is no open related issue.

Please provide full information about the problem you are facing and what steps you have taken to install the repo, what changes you have made (if any) and what have you tried to solve the issue.

Ok, thank you. I have re-debugging it and there is no such problem. Exception, I changed the maximum detection range of the radar to 3.5m, I think this is more in line with real life situation, it becomes partially observable state, so far, I have trained 1700 timesteps, but the average Q and the maximum Q have been decreasing, can you give me some ideas?

reiniscimurs commented 8 months ago

@pzhabc Please open new issues for problems you are facing if there is no open related issue. Please provide full information about the problem you are facing and what steps you have taken to install the repo, what changes you have made (if any) and what have you tried to solve the issue.

Ok, thank you. I have re-debugging it and there is no such problem. Exception, I changed the maximum detection range of the radar to 3.5m, I think this is more in line with real life situation, it becomes partially observable state, so far, I have trained 1700 timesteps, but the average Q and the maximum Q have been decreasing, can you give me some ideas?

Hi,

Are you sure it is only 1700 timesteps? That would mean only about 3 epochs of training. It is not enough training for the model to converge.

What most likely happens is that the model is at some local optimum around the goal position at that stage in training. The reason why the issue does not appear in training steps is because of the random noise. The random noise adds a value to the calculated linear velocity, that then pushes the robot closer to the goal and over the goal reaching threshold. In evaluation steps there is no added noise, so linear velocity is simply the one that is calculated. Most likely the policy outputs 0 linear velocity at this stage, so no reaching of goal occurs. If you really are training only for 1700 timesteps, you most likely just need to train longer.

You can always just try a different random seed as well.

pzhabc commented 8 months ago

@pzhabc Please open new issues for problems you are facing if there is no open related issue. Please provide full information about the problem you are facing and what steps you have taken to install the repo, what changes you have made (if any) and what have you tried to solve the issue.

Ok, thank you. I have re-debugging it and there is no such problem. Exception, I changed the maximum detection range of the radar to 3.5m, I think this is more in line with real life situation, it becomes partially observable state, so far, I have trained 1700 timesteps, but the average Q and the maximum Q have been decreasing, can you give me some ideas?

Hi,

Are you sure it is only 1700 timesteps? That would mean only about 3 epochs of training. It is not enough training for the model to converge.

What most likely happens is that the model is at some local optimum around the goal position at that stage in training. The reason why the issue does not appear in training steps is because of the random noise. The random noise adds a value to the calculated linear velocity, that then pushes the robot closer to the goal and over the goal reaching threshold. In evaluation steps there is no added noise, so linear velocity is simply the one that is calculated. Most likely the policy outputs 0 linear velocity at this stage, so no reaching of goal occurs. If you really are training only for 1700 timesteps, you most likely just need to train longer.

You can always just try a different random seed as well.

Ok, thank you. I'm terribly sorry for my mistake. "iter_count" is 1700. I've been training for a while now, and the average and maximum Q values are starting to rise again. Can this kind of algorithm navigate to the target point in the partially observable state? Most of the articles I have seen seem to consider the partially observable state less. In the real case, do they consider the partially observable state more?

reiniscimurs commented 8 months ago

With 1700 iterations that would still be in the region of 30 to 40 epochs? Which is probably still a bit low. But in any case, you most likely just need a new training cycle where the policy does not fall into local minimum.

If you want to solve a POMDP I do not think you can achieve that with just limiting laser sensor to 3.5 meters. There is no major difference between using a laser with any max range. You still get full-observability in that range and there is no uncertainty. Also, TD3 learning method assumes sample independence, which means that any state action pair does not know anything about other states in policy update, so you would not have a possibility to update the belief of the actual state.

Most of the articles I have seen seem to consider the partially observable state less. In the real case, do they consider the partially observable state more?

I do not understand the question. In any case, you can deploy the trained policy to real world setting and it can navigate from limited view quite easily as a local motion planner.

pzhabc commented 8 months ago

There is no major difference between using a laser with any max range. You still get full-observability in that range and there is no uncertainty.

Hello, thank you for your answer, but I still have a question, limiting the range of lidar to 3.5m, shouldn't the state observed by the robot be only a part of the environment? The entire environment is partially observable to the robot. Doesn't this belong to POMDP?

I do not understand the question. In any case, you can deploy the trained policy to real world setting and it can navigate from limited view quite easily as a local motion planner.

Can it be directly deployed to the actual scene without considering the maximum measuring distance of the lidar in the simulation? For example, in the same 10m*10m room, the maximum measurement distance of the lidar in the simulation is 100m, and the lidar ranging is 3.5m in the actual setting. In this case, can the trained strategy be directly deployed to the actual situation?

reiniscimurs commented 8 months ago

Hello, thank you for your answer, but I still have a question, limiting the range of lidar to 3.5m, shouldn't the state observed by the robot be only a part of the environment? The entire environment is partially observable to the robot. Doesn't this belong to POMDP?

Because of state independence you have full observability of the of the environment for that specific timestep. Your environment is just a 3.5 meter range and it is fully observable without uncertainty, not the complete world that the robot is located in. As far as I am aware, you would require some sort of memory construct to have any knowledge of the full working environment the robot works in. You could read more on that and a solution for that here: https://arxiv.org/pdf/2102.12344.pdf

Can it be directly deployed to the actual scene without considering the maximum measuring distance of the lidar in the simulation? For example, in the same 10m*10m room, the maximum measurement distance of the lidar in the simulation is 100m, and the lidar ranging is 3.5m in the actual setting. In this case, can the trained strategy be directly deployed to the actual situation?

Yes it can (as can be seen in the article and in videos: https://youtu.be/MhuhsSdzZFk?si=l0q25I_s2P3SOViy). Even if the laser setting state 100 meter max range, in the training part we actually cap it to 10 meters in: https://github.com/reiniscimurs/DRL-robot-navigation/blob/main/TD3/velodyne_env.py#L137 So maximum considered range in this training setting is actually 10 meters. Even if the value is returned as inf it will be replaced with value 10. By the way, if you changed your laser maximum reading value to 3.5 meters, you would also need to update this capping value.

pzhabc commented 8 months ago

Ok, thanks for your help. Would it be better to start training when the replay buffer reaches a certain size?

reiniscimurs commented 8 months ago

There are a couple of bootstrapping methods available in this repo. Feel free to add more. Filling replay buffer before starting training is a viable option but I have never needed it to train this method from scratch. It certainly would make some sense to have more varied samples already in the beginning of training, but most likely it would only bring any real impact if they would be meaningful samples. Meaning, that they would be of episodes that reach the goal. You might want to use some rule based method (or manually guide the robot) to collect such samples. You can read a bit about such implementation here: https://www.mdpi.com/1424-8220/22/20/7750

pzhabc commented 8 months ago

There are a couple of bootstrapping methods available in this repo. Feel free to add more. Filling replay buffer before starting training is a viable option but I have never needed it to train this method from scratch. It certainly would make some sense to have more varied samples already in the beginning of training, but most likely it would only bring any real impact if they would be meaningful samples. Meaning, that they would be of episodes that reach the goal. You might want to use some rule based method (or manually guide the robot) to collect such samples. You can read a bit about such implementation here: https://www.mdpi.com/1424-8220/22/20/7750

Okay, thank you.