reiniscimurs / DRL-robot-navigation

Deep Reinforcement Learning for mobile robot navigation in ROS Gazebo simulator. Using Twin Delayed Deep Deterministic Policy Gradient (TD3) neural network, a robot learns to navigate to a random goal point in a simulated environment while avoiding obstacles.
MIT License
488 stars 98 forks source link

compare between 'done_bool' and 'bool' #34

Closed mincheulkim closed 1 year ago

mincheulkim commented 1 year ago

Hello, thank you for your fantastic work.

When the TD3 network uses the replay memory to learn the policy, 'done_bool' is stored in the replay memory instead of 'done.' Therefore, TD3 can learn a success or collision situation as 'done' but does not treat timeout. Some other mobile navigation work considers the timeout as done and gives a negative reward. Is there a reason to use 'done_bool' (do not care timeout case)?

reiniscimurs commented 1 year ago

Hi,

I have seen some other implementations that do use the timeout as a done state, but for some implementations, I am rather skeptical that it actually works as intended. Mostly, because they do not include the time measure in the input state.

Consider a state = [1, 1, 1, 1, 1] that returns 5 laser values (here I just use 1 as a dummy value). The value of such a state would consist of reward + discounted future rewards (as per Bellman equation).

Now consider a state = [1, 1, 1, 1, 1], but the timesteps have expired. That would set bool value to true and the state value is reward + 0*discounted future rewards (again, Bellman equation).

As you can see the state is the same, but somehow the value of it has changed. It is difficult to learn the state value if it is not at least somewhat consistent. Therefore, I think in such cases, the current step number need to be added to the state representation.

However, I still (personally) do not see the reason to put a timer on the robot arriving at the goal. The optimal policy will be the one that arrives to the goal faster, so that is what it will be optimized for, regardless of if there is a timer or not. I am not sure what kind of a contribution would the timer even have for the policy as the optimal output for each individual state would not change (as the reward is the same). I guess you could assume that the discounted future reward will be lower, if you are close to the timeout, but again, not sure how that contributes to arriving at the goal.

In any case, these are just my observations and thoughts, so it does not mean it is definitely true. The best way to go about it would be to include the timeout in the state, add the timeout done bool to replay buffer, and see if there is any benefit to it for the policy.

mincheulkim commented 1 year ago

thanks for your kind answer. I have another question about network architecture about actor and critic in TD3.

In original TD3 code, Actor utilize state as input and Critic concatenate state and action as input, so the Critic can evaluate the value of the action. But in your code, Both Actor and Critic utilize same input which concatenate the state and the action, so it feels like there are duplicate action in the critic network. Have you tried implementing the code in the form of original TD3 that does not include an action in the state?

reiniscimurs commented 1 year ago

The action in the state is the previously taken action, and represents the current dynamics of the robot. Due to inertia, performing the same action might give a different outcome, if starting the execution from static position or if the robot is already in motion. This is a common approach in literature. I have trained networks without this information, and including it proves to work better. In any case, if it wouldn't, the neural network would just learn to ignore this information.

It is not quite true that actor and critic has the same inputs. The actor has only the state as input, whereas the critic has state and the newly calculated action as separate inputs. This action input is passed over the first dense layer, and introduced in the network later. Also, the action information in the state comes from replay buffer, but the action as critic input comes from the actor network.