reiniscimurs / DRL-robot-navigation

Deep Reinforcement Learning for mobile robot navigation in ROS Gazebo simulator. Using Twin Delayed Deep Deterministic Policy Gradient (TD3) neural network, a robot learns to navigate to a random goal point in a simulated environment while avoiding obstacles.
MIT License
487 stars 97 forks source link

The final trained policy model #91

Closed am-amani closed 2 months ago

am-amani commented 6 months ago

Thank you for your well-written documentation. Is the final trained model policy available to download?

reiniscimurs commented 6 months ago

Hi,

No I have not uploaded trained model weights. If the repo is setup correctly it should be able to train the model in a day ir less though.

am-amani commented 6 months ago

Hi Reinis, I have GPU RTX 2080 on my system, and I calculated it would take 188 hours to complete. Is it possible for me to ask you to upload it on Google Drive and share it via email or share the link on your github page?

reiniscimurs commented 6 months ago

I am training on GeForce GTX 1660 Ti on a laptop and it trains in about a day for me. I assume the 188 hours is a reach. In any case, I do not have trained weights for this repo right now and I do not plan to train it any time soon, so I do not think I could help you with already pre-trained model.

am-amani commented 6 months ago

Ok, thanks, last question.. Did you modify the training code? Specifically the the number of iterations? Or it is as it is in your repository?

reiniscimurs commented 6 months ago

You do not have to run the whole iteration count. You can stop the training when you feel the model has learned the behavior you want. I would usually train for about 100 epochs

am-amani commented 6 months ago

Thank you for letting me know :)

JavierMtz5 commented 3 months ago

Hi Reinis, congrats for your work and thanks for making it public, it is super interesting and useful!

I would like to know if, when you train for about 100 epochs and the training converge, you change the exploration_decay_steps parameter. I am asking this because the training is first intended to run for 1M timesteps, but 100 epochs will be able to run less timesteps. As the expl_decay_steps is set to 500000, that means that for 100-epochs-trainings it might happen that the explopration noise is still very high by the time the training is done (100 epochs).

I would like to know how this problem affects the training process.

Thank you so much for your help!

JavierMtz5 commented 3 months ago

Hi again Reinis,

I would also like to ask if you remember approximately how many epochs were performed until the robot started to get to the goal/avoiding obstacles. I trained for 65 epochs, but the robot did not show any intelligent behavior: it just went spinning or colliding directly with a wall, but nothing close to reaching the goal or avoiding walls. I would like to know if this is due to a bad weights initialization (I think is not related to this, as I trained with 3 different seeds and the 3 trainings were showing the same behavior), or is it just that i need to wait a bit more to see the expected behavior on the robot. I am a bit worried because it takes so much to reach 65 epochs, and when I reach that point I don't know if I should expect a good behavior already, or if it is normal for the robot to still be lost and colliding. When the robot reaches 65 epochs without intelligent behavior, do you think I should wait until 100 epochs, or just restart the training with a different seed? (Any other ideas are welcome!)

Thank you in advance for your help!

reiniscimurs commented 3 months ago

Hi Reinis, congrats for your work and thanks for making it public, it is super interesting and useful!

I would like to know if, when you train for about 100 epochs and the training converge, you change the exploration_decay_steps parameter. I am asking this because the training is first intended to run for 1M timesteps, but 100 epochs will be able to run less timesteps. As the expl_decay_steps is set to 500000, that means that for 100-epochs-trainings it might happen that the explopration noise is still very high by the time the training is done (100 epochs).

I would like to know how this problem affects the training process.

Thank you so much for your help!

Hi,

I am not quite sure what the question is. We apply random noise to force the robot to explore other actions that it otherwise wouldn't if it would just follow policy. This decays over 500'000 steps. There are at least 5000 steps in one epoch, so it would decay over 100 epochs to the minimal number. The value we are decaying is the standard deviation for the Gaussian distribution. So in most cases, we will still get values of the policy or very close to it. By 100 epochs it will have decayed to expl_min value.

You should see reasonable behavior in around 20 to 40 epochs. At least some obstacle avoidance. If that is not the case, restart the training or debug to see if there are any issues.

JavierMtz5 commented 3 months ago

Thank you so much for your reply and help!

Sorry if I was not clear enough. I just wanted to know if, when running a training for half the maximum timesteps (500K, 100 epochs), the expl_decay_steps should also be adjusted to the half, to get the min noise value at half of the training (that way we will explore and exploit in a balanced way), just as it would happen for the default max_timesteps and expl_decay_steps values (1M and 500K). Either way, your answer helped me understand the issue, so I do not have more doubts regarding this topic.

With regards to the number of epochs at which the robot starts to show inteligente behavior, I will try different seeds, as the training and the metrics look good, which makes me think that it is a matter of weights initialization.

Again, thank you so much for your help!

reiniscimurs commented 3 months ago

The maximum number of steps is simply a large number of steps during which the training should run. It is not indicative of how long the training should be run for and is just a placeholder number. Since the initialization is random and training is somewhat unstable there is no set number of steps that guarantee convergence. So we simply use a very large step count here to ensure that training does not run forever. But if it has converged earlier, the training should be stopped at that time.