reiniscimurs / DRL-robot-navigation

Deep Reinforcement Learning for mobile robot navigation in ROS Gazebo simulator. Using Twin Delayed Deep Deterministic Policy Gradient (TD3) neural network, a robot learns to navigate to a random goal point in a simulated environment while avoiding obstacles.
MIT License
489 stars 99 forks source link

Which reward function is used? #2

Closed AgentEXPL closed 2 years ago

AgentEXPL commented 2 years ago

Different ways for generating the reward have been provided in the velodyne_td3.py. The following way is used finally and the other ways are commented out. I noticed that this way of generating reward is different from the reward function provided in the paper.


    # reward = act[0]*0.7-abs(act[1])
    # r1 = 1 - 2 * math.sqrt(abs(beta2 / np.pi))
    # r2 = self.distOld - Dist
    r3 = lambda x: 1 - x if x < 1 else 0.0
    # rl = 0
    # for r in range(len(laser_state[0])):
    #    rl += r3(laser_state[0][r])
    # reward = 0.8 * r1 + 30 * r2 + act[0]/2 - abs(act[1])/2 - r3(min(laser_state[0]))/2
    reward = act[0] / 2 - abs(act[1]) / 2 - r3(min(laser_state[0])) / 2
    # reward = 30 * r2 + act[0] / 2 - abs(act[1]) / 2  # - r3(min(laser_state[0]))/2
    # reward = 0.8 * r1 + 30 * r2

I guess all these ways have been tested by you. Could you tell me which way of generating reward achieves the best performance?

reiniscimurs commented 2 years ago

That depends on how you define the performance. For the implementation in the paper act[0] - abs(act[1]) was used as that is the simplest one to work with and does not allow the robot to get stuck rotating around on the spot.

Other ones are more to allow people to test them out and see how they perform. Some I have found in other papers and some are just experiments. The value of r3 is introduced just to force the robot further away from the obstacles for safer navigation, especially in dynamic environments, as people tend to not like that the robot takes evasive action close to them.

AgentEXPL commented 2 years ago

Thanks a lot for your explanation. The best performance I want to get is that the agent is able to reach the goal with the shortest smooth path. It would be greatly appreciated if some suggestions on desgining reward with this aim could be provided.

By the way, whether the equation "rt−i = r(st−i, at−i) + rg/i" in the paper is implemented in the code?

reiniscimurs commented 2 years ago

By the way, whether the equation "rt−i = r(st−i, at−i) + rg/i" in the paper is implemented in the code?

Not in this version. I took it out to simplify the implementation, but it works with storing the tuple (s,a,r,s',t) in a deque with a fixed length. After each step, we look if the episode has terminated. If not, then store the oldest tuple from the deque in the replay buffer and add the current tuple in the deque. If it has terminated, update all the r values in the deque and push them all to the replay buffer.

Originally this repository was not meant to be a companion for the paper, but just as a simple way to start working with DRL in ROS simulator, so there might be some discrepancies.

reiniscimurs commented 2 years ago

The best performance I want to get is that the agent is able to reach the goal with the shortest smooth path. It would be greatly appreciated if some suggestions on desgining reward with this aim could be provided.

So this is a tough question to answer and one I would like an answer to myself. Designing a proper immediate reward function to fulfill a task that we want is really difficult as it is hard to predict what kind of unwanted behaviors we are encouraging with it. Therefore a lot of researchers choose to use Inverse Reinforcement Learning. But I can give some observations.

Reward only terminal states: By this I mean setting the immediate award r = 0. Theoretically, if the actions that the robot takes have led it to the goal, the bellman equation should "backpropagate" the positive values of these states, the next time we encounter the same state. The issue with it is that there is a lot larger chance of crashing into an obstacle than arriving at a goal, so most of the time states will be evaluated negatively. What happens is that the robot chooses to just rotate on the spot instead of moving around, as getting a total sum of 0 values over the episode is better than any negative value that it thinks it will get by moving around. So the robot will need to be forced to explore or some sort of bootstrapping or experience weighting is required. Theoretically though, if that can be solved, the robot should take the shortest possible path to the goal as these states would have the highest backpropagated values. However, in practice, this did not work for me.

Reward reducing distance: r = d - d' where distance is the distance to the goal in the previous time step and d' is the distance to the goal in the current time step. This rewards the robot for moving closer to the goal at every step. The closer you get, the higher the reward. However, if you are familiar with the "carrot planner" from the navigation package in ROS, you might see that there is a similar problem with this. Mainly that it may guide the robot to local optimum that it cannot get out of. If there is a "pocket" in front of the robot, it will navigate into it at it is closer to the goal and the immediate reward for it is higher. Once it hits the dead-end, it cannot move backward as that gives it a negative immediate reward so it never learns how to get out of these situations. To solve that, the robot needs to realize early enough that it is going into a dead-end, but that can be very tricky with limited laser range and without global information. Maybe, it can be solved by using some LSTM layers, but in a simple form, this problem gets a bit tricky to solve.

Reward moving forward: r = v - w , where v is the linear velocity and w is the rotational velocity. This is my favorite method and the one that works best for me. The idea behind it is that the robot needs to realize that it should be moving around and not just sitting at a single spot. By setting positive reward for linear motion robot first learns that moving forward is good and rotating is not. Even though it crashes lot in the beginning, the episode reward is still higher than just sitting in one spot and rotating. Soon it learns, that taking a turn near obstacles is more beneficial than crashing, even though it still might be a negative action, but it is less negative than crashing. With this, robot soon learns to avoid obstacles with a smooth motion as the smaller the rotation the smaller the penalty for turning. While running around the environment the robot randomly will end up reaching the goals and in time realize that the benefit of reaching the goal outweighs the penalty for turning. Even if it ends up in a pocket, the robot will still know that being in forward motion will give it a positive reward and thus will unintentionally look for a way out of the pocket. However, this method does not give you the shortest possible path to the goal as the robot will prefer to also have linear motion while having rotational motion as that will give a better reward. That means that the robot will take larger turn just to turn around, if possible, which will increase the total path length. But it generally gives very smooth motion of the robot. The size of the turn though can be minimized by giving a smaller penalty for turning.

Most likely, the best reward is a combination of bunch of different functions and more research needs to be done in this field. I have always thought that it would be a fun research paper to write by directly comparing different immediate reward methods.

AgentEXPL commented 2 years ago

It is really an impressive answer. The meaning behind it is of great help. It provides instructions on how to design a good reward function for a mobile robot. Many thanks!