Add Image information to TD3 training

reiniscimurs / DRL-robot-navigation

Deep Reinforcement Learning for mobile robot navigation in ROS Gazebo simulator. Using Twin Delayed Deep Deterministic Policy Gradient (TD3) neural network, a robot learns to navigate to a random goal point in a simulated environment while avoiding obstacles.

MIT License

488 stars 98 forks source link

Add Image information to TD3 training #14

Closed Barry2333 closed 1 year ago

Barry2333 commented 2 years ago

Hi, Reinis Cimurs:

Thank you for supporting this excellent project.

Have you ever tried to add image(front camera) information as one of the states in your TD3 network?

Do you think adding more information about the surrounding environment will improve the navigation performance? Could you give me some hints on how to improve the performance of your project?

I would appreciate your reply. Thank you.

reiniscimurs commented 2 years ago

Hi.

As a quick note, I have done previous work with Depth camera information as a state representation in a very similar setting and described some issues related to it in my paper here: https://www.mdpi.com/2079-9292/9/3/411

Of course, the answer is going to depend on what exactly you want to accomplish with your network and it can be a viable solution to include camera data in state representation, but there are some issues that come along with it.

Generalizability. If you want to use an RGB image as input, you will require some sort of CNN architecture to parse this data. Then a motion decision would be made based on the latent features from the CNN output. The problem with this is that it requires a large amount of realistic simulation scenarios to consider in order to not just learn the environment, but actually generalized responses to generic situations. If learned in simulation, there will be a huge sim2real gap for images if you plan to use it on a real robot afterwards. This is not that difficult for laser based data, as lasers just output intensity, that can be simulated fairly easily. But for RGB data, such things as texture, lighting conditions, shapes of objects, intrinsic camera parameters etc. all are going to affect the performance of the CNN. For such training, Gazebo simulator is not viable and you would need to move on to Unity or other simulators with more realistic modeling. One solution would be to train your CNN on a dataset (NYU or others) for classification, then use this pretrained network in simulation to learn the motion policy. However, you will still need a lot more varied simulation in order to cover that sim2real gap. This current pipeline is simply not made for that, but it would be interesting to see the approach in action so I do not want to discourage you.
Resources. If you look at the used model here, you will notice that the amount of parameters in the network is relatively small. And for the actual deployment of the network it is even smaller than that (since you only require the actor network to make motion decision). That is because we only use a single vector representing the laser data. Introducing RGB images (or even just depth images for that matter) significantly increases the dimensionality of your input and the size of your network. More parameters = more resources. This was developed as more of a cheap/light weight solution for motion policies. We were able to deploy this on I5 CPU mini PCs without any graphics card support. RGBs of course would require a lot more resources than that.
FOV. As mentioned in the aforementioned article, field of view is a bit of a problem for cameras as the usual 90 degree viewing angle alone is just not wide enough for robot motion. You would require some sort of sensor fusion/calibration to achieve reliable tasks. All solvable, but requires time, engineering, effort already in simulation, not to mention when dealing with sim2real transfer.

So while it might bring some benefit and some sort of solution could probably be found, RGBs might be too difficult to work with to get any meaningful result in this pipeline. Instead, perhaps depth images would be a better solution as they do not care how realistic the simulation is, rather how varied the environment and shapes in it are.

As for improvements, there are a couple of "low hanging fruits" that I can think of.

Trying to find a better reward function
More complex/advance architecture (more hidden layers/dropout layers/skips etc)
More time-steps in the input (currently uses only single time step/ the bagging of laser data might also be improved)
Using LSTM or GRU layers in the network (currently at each step the network does not care about what it has seen previously and it "forgets" its previous motions. Some sort of history should prove to be beneficial)

Personally, I would be interested in seeing what could be achieved with introducing some sort of history either through LSTMs or previous states in the input. In this case, dynamic obstacles could also be introduced in the simulation.

Let me know if you have any questions, I will try to give more detailed answers if I can.

Barry2333 commented 2 years ago

@reiniscimurs Thanks for your detailed information and suggestions.

I would like to first train the network by integrating a pre-trained CNN model and then trying LSTM to see what will happen.

Have a nice day! Cheers!

Barry2333 commented 2 years ago

Hi, Reinis Cimurs:

Have you ever met this problem before? It looks like receiving '/r1/odom' and '/r1/front_laser/scan' have something wrong.

There was fine before, but when I tried to run your code today. This problem happened.

Please give me some suggestions. Thanks! Looking forward to your reply! :)

BTW, I have already tried to use none conda environment for running. But this issue still exists.

screenshot

reiniscimurs commented 2 years ago

Hi,

Sorry for delayed reply. Did you manage to fix this?

In this case, the error appears because an object is expected but in reality None is returned. This might be because some process has already closed the connection.

What does your rqt graph look like when this error pops up?
Are you closing the previous run properly? (Stopping the previous run by executing Ctr+c and then in a new terminal window executing killall -9 rosout roslaunch rosmaster gzserver nodelet robot_state_publisher gzclient python python3

Barry2333 commented 2 years ago

Hi, Reinis Cimurs:

I have already fixed this problem. For clarity, I used Ubuntu 18.04 system with ROS melodic. In velodyne_env.py:

For handling laser range and Odom data. The way you have done was:

<Melodic branch/TD3/veloddyne_env.py>: In step() function: you used rospy.wait_for_message function data = rospy.wait_for_message('/r1/front_laser/scan', LaserScan, timeout=0.5) dataOdom = rospy.wait_for_message('/r1/odom', Odometry, timeout=0.5)

When I look at the <Noetic branch/TD3/veloddyne_env.py>: [1] In the env initiation step: {create subscribers}: self.laser = rospy.Subscriber('/r1/front_laser/scan', LaserScan, self.laser_callback, queue_size=1) self.odom = rospy.Subscriber('/r1/odom', Odometry, self.odom_callback, queue_size=1) {define data variables}: self.last_laser = None self.last_odom = None [2] In step(): Directly obtain data from self.last_laser and self.last_odom, because there exists a callback function in those two subscribers self.laser_callback and self.odom_callback, when the simulation runs, those two callbacks will update self.last_laser and self.last_odom data in a certain frequency.

When I used the second method, which is to define two subscribers but not use the rospy.wait_for_message function. The problem was just solved. From the error report on the terminal, I think the error comes from the python threading problem. when I searched on StackOverflow and based on my understanding this problem was because the laser data and Odom data receiving were not synchronized. rospy.wait_for_message will wait until the message comes. When using rospy.wait_for_message in sequence will make data received in a sequential way. But if we crease a subscriber, data will be updated in a pre-defined frequency so as to avoid the asynchronous problem, I am not sure whether this is correct.

BTW Reinis, I got some questions and hope you can give me some information.

Q[1] When I looked at your actor structure. You considered the actions as one of the states. I was curious why actions can also be considered as states. Because in my understanding, actions are not considered a state.

Q[2] Have you ever considered the distance between the current position and target position as a penalty, also the timestep per episode as a penalty. If you have done so, was it better than your current reward setting?

Q[3] Have you tried different sizes of layers? I noticed that the training time is quite long and I was wondering if shrinking the layer size will be better?

Thank you for your reading and have a nice day! Cheers!

reiniscimurs commented 2 years ago

Hi,

Yes, the subscriber method is preferable over the wait_for_message type. If you get the chance, feel free to push a PR to change it in melodic branch.'

Essentially, state is whatever information you deem necessary to describe the scene that the agent is in at the current time step. It does not need to be only the physical states of the surrounding environment and can combine all sorts of information about the external, internal, previous and future states of the system and so on. In our implementation, we wanted to give the network not only the state itself but also the information how it arrived to it. Also, you can see that the reward function in our case is highly dependent on the linear and angular velocities, so we speculate that giving the action taken as a state information allows the network to differentiate why a reward is different in different instances even though the laser and goal information appears exactly the same.
I touch upon this a bit in my answer here: https://github.com/reiniscimurs/DRL-robot-navigation/issues/2#issuecomment-962382003 If you use distance to the goal as penalty or as a reward, you run into issues with navigating out of pockets of local optimum. It probably could be used as a metric for a reward function along with others, but does not seem to be sufficient on its own. The timestep per episode will require you to add this time information into your state representation, or else the network will not be able to learn. The network training is done by using a batch of independent samples and calculating loss over them. There is no sequentiality in this information, nor time information. Essentially the time information becomes somewhat irrelevant by using the Bellman equation. This will ensure that by selecting an action that gets you to the goal quicker will be evaluated higher, thus allowing to reach the goal faster. I touch upon this a bit in the answer that I linked above.
Trying different sizes and combinations of layers is something that could bring benefit for the performance of the policy, but it will not significantly speed up the current training pipeline. The largest amount of time is spent on collecting the trajectories in the simulation and the actual training happens almost instantly (at least on my PC). The way that the training works is as follows:
- Run simulation with the current policy and save each (s,a,r,t,s') tuple in the replay buffer until the robot crashes, reaches a goal or has taken maximum number of steps (500 steps in this repo)
- Randomly sample a batch of (s,a,r,t,s') tuples from the whole replay buffer
- Evaluate loss on the batch and calculate gradients
- Update the network

This is done sequentially and training of the network (steps 2-4) does not happen while an episode is running. This happens right after crash/goal/running out of steps and before resetting the simulation. The longest time is spent performing the part in bold. This is because the simulation runs in real time. One way to speed up the training would be to speed up the simulation, but back when I was looking into it, I was not able to do it in a manner that does not break it eventually in one way or another. Perhaps, some updates have been made and it is possible now. I do not know. Perhaps speedup could be achieved by augmenting the data in a way that makes sense and running more update iterations. Lowering the number of parameters in layers (600 and 400 parameters works quite well as well) can help a bit, but not as much as one might wish.

Barry2333 commented 2 years ago

Wow! Thank you for your impressive response.

I will push a PR. Thank you again for your time 👍 !

reiniscimurs commented 2 years ago

Btw, here is a method of how to run the simulation faster: https://answers.gazebosim.org//question/6970/can-i-run-gazebo-faster-than-real-time/

You would just need to increase the rate in https://github.com/reiniscimurs/DRL-robot-navigation/blob/be811a4050dfcb5c800a0a0e4000be81d48cfbc5/catkin_ws/src/multi_robot_scenario/launch/TD3.world#L69 in the .world file that you will use (TD3.world in this case). Use it at your own discretion, because it used to break my simulation. I will test it on my setup in Noetic and see if any issues come up, but this is not guaranteed to work.

Barry2333 commented 2 years ago

Good, that is worth a try! I will increase the update_rate a little bit to see what will happen. Thanks! :)

reiniscimurs commented 1 year ago

Closing due to inactivity