reiniscimurs / DRL-robot-navigation

Deep Reinforcement Learning for mobile robot navigation in ROS Gazebo simulator. Using Twin Delayed Deep Deterministic Policy Gradient (TD3) neural network, a robot learns to navigate to a random goal point in a simulated environment while avoiding obstacles.
MIT License
486 stars 97 forks source link

About changing models #113

Closed oycool closed 2 months ago

oycool commented 4 months ago

Hi Reinis Cimurs, first of all, thank you very much for sharing. I made some modifications to the model. When training for 14 episodes (that is, collision or reaching the target), the action becomes Nan. It's difficult for a newbie to modify it, can you give some suggestions to solve the problem based on your experience? Looking forward to your reply.

reiniscimurs commented 4 months ago

Hi,

Please provide details. I won't know anything without any information.

oycool commented 4 months ago

Thank you very much for your reply. I have solved this problem. I have one more question. I trained 1000 episodes and generated files in 'pytorch_models'. If I restart training, will the files generated by these training be overwritten or will the data continue to be written?

reiniscimurs commented 4 months ago

Yes, they will be overwritten.

oycool commented 4 months ago

Thank you for your answer.

oycool commented 4 months ago

Hi Reinis Cimurs,I've run into a problem again. I changed the car model to an Ackermann car and used a single-line lidar. After I modified the reward function, at the beginning of training, the car would drive for a long time in one epoch before it collided and was slow. However, after 663 episodes of training, the car quickly rushed towards the obstacles, and Max_Q and Loss also fluctuated greatly. Is this a normal phenomenon? reward kuang tensorboard

oycool commented 4 months ago

I added the number of steps the car took in an episode and its proximity to the target into the reward function. When the number of steps is >350, a penalty will be given. When the distance is <1, a reward will be given when approaching the target.

reiniscimurs commented 4 months ago

Are you giving the timestep information in the state for the model?

If not, then think about how would the model be able to learn the value of state-action pair? Consider 2 exactly identical state-action pairs but one happens on step 100 and the other on step 400. The model needs to estimate the value of this state-action but at both cases the reward you return is different. It just does not have information on what changed to be able to apply gradient. It is just noisy reward at that point. So consider if you need this kind of reward formation.

oycool commented 4 months ago

If steps are taken into account, do I need to add 'episode_timesteps' to the 'state' array and then pass it to the model to get 'action'?

reiniscimurs commented 4 months ago

I would expect so. I guess you are trying to force the policy to go to the goal as fast as possible, but I would think about it a bit if this is a good approach.

oycool commented 4 months ago

Yes, because I trained for 40 epochs and 3000 episodes before, but still couldn't reach the goal.

reiniscimurs commented 4 months ago

Generally, it should work right out of the box. If bot, you could try to change the seed value and have different initialization weights

oycool commented 4 months ago

OK, thanks for your reply.