run on a headless server

rosanom commented 1 year ago

Hi, thank you for having released this cool project! My goal is to run the simulation and training on a headless server (os without GUI). Could I face issues with the gazebo simulator or should I flag something to make it work properly? Follow-up question: what do you think about running everything inside a docker container? Do you have any tips on which image to use?

Thank you. Regards!

reiniscimurs commented 1 year ago

Hi, sure you should be able to run it without a GUI as long as you have access to terminal. Gazebo simulator is already turned off by drfault and you are not required to run Rviz either. The tensorboard should provide you with the course of the training and terminal outputs are also available. However, the KPIs that i have set up here might not always be super representative of how well the model performs. I usually try to also visually determine the progress of the training. But if you add some additional KPIs, visual confirmation is not strictly necessary.

Sorry, have never worked with ROS in docker setup so I cannot help you with that aspect.

rosanom commented 1 year ago

I see, thank you for your answer. I'm trying to run the system on a docker container with ubuntu 20.04, ros noetic and cuda. I update the thread as soon as I get everything up.

From a first run, I receive this output

process[gazebo-1]: started with pid [48299]
process[urdf_spawner-2]: started with pid [48302]
process[robot_state_publisher-3]: started with pid [48305]
process[joint_state_publisher-4]: started with pid [48306]
ERROR: cannot launch node of type [rviz/rviz]: rviz
ROS path [0]=/opt/ros/noetic/share/ros
ROS path [1]=/home/mrosano/DRL-robot-navigation/catkin_ws/src/velodyne_simulator/velodyne_simulator
ROS path [2]=/home/mrosano/DRL-robot-navigation/catkin_ws/src/velodyne_simulator/velodyne_gazebo_plugins
ROS path [3]=/home/mrosano/DRL-robot-navigation/catkin_ws/src/velodyne_simulator/velodyne_description
ROS path [4]=/home/mrosano/DRL-robot-navigation/catkin_ws/src/multi_robot_scenario
ROS path [5]=/opt/ros/noetic/share
[INFO] [1672928601.180439, 0.000000]: Loading model XML from ros parameter robot_description
[INFO] [1672928601.186518, 0.000000]: Waiting for service /gazebo/spawn_urdf_model
[ INFO] [1672928601.328079499]: Finished loading Gazebo ROS API Plugin.
[ INFO] [1672928601.329107483]: waitForService: Service [/gazebo/set_physics_properties] has not been advertised, waiting...
[ INFO] [1672928601.554408391]: waitForService: Service [/gazebo/set_physics_properties] is now available.
[ INFO] [1672928601.568448960, 0.207000000]: Physics dynamic reconfigure ready.
[INFO] [1672928601.791623, 0.425000]: Calling service /gazebo/spawn_urdf_model
[ INFO] [1672928602.173754939, 0.544000000]: Laser Plugin: The 'robotNamespace' param was empty
[ INFO] [1672928602.173841720, 0.544000000]: Starting Laser Plugin (ns = r1)
[ INFO] [1672928602.175132927, 0.544000000]: Laser Plugin (ns = r1)  <tf_prefix_>, set to ""
[INFO] [1672928603.477350, 0.544000]: Spawn status: SpawnModel: Successfully spawned entity
[ INFO] [1672928603.478113623, 0.544000000]: Velodyne laser plugin missing <min_intensity>, defaults to no clipping
[ INFO] [1672928603.480406014, 0.544000000]: Velodyne laser plugin ready, 16 lasers
[ INFO] [1672928603.583988398, 0.544000000]: Starting plugin DiffDrive(ns = r1/)
[ INFO] [1672928603.584088024, 0.544000000]: DiffDrive(ns = r1/): <rosDebugLevel> = Debug
[ INFO] [1672928603.585072794, 0.544000000]: DiffDrive(ns = r1/): <tf_prefix> = 
...
...
[ WARN] [1672928603.585679791, 0.544000000]: GazeboRosDiffDrive Plugin (ns = ) missing <publishTf>, defaults to 1
[ INFO] [1672928603.586876196, 0.544000000]: DiffDrive(ns = r1/): Advertise joint_states
[ INFO] [1672928603.587973843, 0.544000000]: DiffDrive(ns = r1/): Try to subscribe to cmd_vel
[ INFO] [1672928603.592225340, 0.544000000]: DiffDrive(ns = r1/): Subscribe to cmd_vel
[ INFO] [1672928603.593221874, 0.544000000]: DiffDrive(ns = r1/): Advertise odom on odom 
[ INFO] [1672928603.598418598, 0.544000000]: GazeboRosJointStatePublisher is going to publish joint: chassis_swivel_joint
[ INFO] [1672928603.598452058, 0.544000000]: GazeboRosJointStatePublisher is going to publish joint: swivel_wheel_joint
[ INFO] [1672928603.598476248, 0.544000000]: GazeboRosJointStatePublisher is going to publish joint: left_hub_joint
[ INFO] [1672928603.598498880, 0.544000000]: GazeboRosJointStatePublisher is going to publish joint: right_hub_joint
[ INFO] [1672928603.598519501, 0.544000000]: Starting GazeboRosJointStatePublisher Plugin (ns = r1/)!, parent name: r1
[DEBUG] [1672928603.636296702, 0.565000000]: Trying to publish message of type [nav_msgs/Odometry/cd5e73d190d741a2f92e81eda573aca7] on a publisher with type [nav_msgs/Odometry/cd5e73d190d741a2f92e81eda573aca7]
[DEBUG] [1672928603.636364164, 0.565000000]: Trying to publish message of type [sensor_msgs/JointState/3066dcd76a6cfaef579bd0f34173e9fd] on a publisher with type [sensor_msgs/JointState/3066dcd76a6cfaef579bd0f34173e9fd]

[urdf_spawner-2] process has finished cleanly
log file: /root/.ros/log/809b0176-8d04-11ed-a8db-0242ac110003/urdf_spawner-2*.log
Validating
..............................................
Average Reward over 10 Evaluation Episodes, Epoch 1: -72.518953, 0.800000
..............................................

Is this the output I should expect? Does the [urdf_spawner-2] process has finished cleanly message show a normal behavior? I monitor the training from tensorboard. For now the "Av.Q" goes lower and lower, the "Max. Q" increases, the "Loss" is stable to around 200 since last 150 steps.

Best, Marco

reiniscimurs commented 1 year ago

Yes all of that looks like it is working well and all looks like things that I would expect. The urdf spawner thing is related to spawning the robot in Gazebo simulator and that is the expected behavior.

If it has gone through a validation and no errors popped up in the terminal, that should be a good indicator that everything is up and running as it should.

rosanom commented 1 year ago

Update: I successfully launched the training process inside a docker container with ros noetic, cuda and all the libraries required but unfortunately it diverges instead of converging to the optimal solution. I launched the script more than 5 times and I trained the model for more than 150 epochs. The behavior is always the same: on tensorboard the average Q value steadily decreases (-700 after 14k steps), the max Q value is really noisy, the loss reached very high values (10k after 14k steps). It seems something is not working properly. Do you have any idea on what the issue could be?

reiniscimurs commented 1 year ago

It is good to hear that it worked in headless state.

I guess this is where the issue would arise, as you cannot visually evaluate if everything is working alright.

Some questions that might point out some issues: Did you try changing the seed? Do you get any kind of errors or warnings in the terminal? What does the terminal say? Are the sensors working properly?

And can you visually see the tensorboard?

rosanom commented 1 year ago

Did you try changing the seed?

No, I did not. I'm trying now with a different seed.

Do you get any kind of errors or warnings in the terminal?

I attach the output of the terminal. I see there is an error related to the fact that rviz is not installed but I guess it should not affect negatively the training process. I don't know if the following few lines represent an issue.

Roscore launched!                                                                                    
Unable to register with master node [http://localhost:11311]: master may not be running yet. Will keep trying.                                                                                            
... logging to /root/.ros/log/fcf857dc-9583-11ed-8eb4-0242ac110002/roslaunch-a40273904425-4796.log   
Checking log directory for disk usage. This may take a while.                                        
Press Ctrl-C to interrupt
Done checking log file disk usage. Usage is <1GB.                                                    

started roslaunch server http://localhost:46343/                                                     
ros_comm version 1.15.14                                                                             

SUMMARY                                                                                              
========                                                                                                                                                                                                  
PARAMETERS                                                                                           
 * /rosdistro: noetic                                                                                
 * /rosversion: 1.15.14                                                                              

NODES                                                                                                                                                                                                     
auto-starting new master                                                                             
process[master]: started with pid [4816]                                                             
ROS_MASTER_URI=http://localhost:11311/                                                                                                                                                                    
setting /run_id to fcf857dc-9583-11ed-8eb4-0242ac110002                                              
process[rosout-1]: started with pid [4826]                                                           
started core service [/rosout]                                                                       
Gazebo launched!                                                                                     
... logging to /root/.ros/log/fcf857dc-9583-11ed-8eb4-0242ac110002/roslaunch-a40273904425-4835.log   
Checking log directory for disk usage. This may take a while.                                        
Press Ctrl-C to interrupt                                                                            
Done checking log file disk usage. Usage is <1GB.                                                                                                                                                         started roslaunch server http://localhost:44729/                                                     

SUMMARY                                                                                              
======== 
PARAMETERS                                                                                            
* /joint_state_publisher/publish_frequency: 30.0                                                    
 * /robot_description: <?xml version="1....                                                          
 * /robot_state_publisher/publish_frequency: 30.0                                                    
 * /rosdistro: noetic                                                                                
 * /rosversion: 1.15.14                                                                              
 * /use_sim_time: True                                                                               

NODES                                                                                                
  /                                                                                                  
    gazebo (gazebo_ros/gzserver)                                                                     
    joint_state_publisher (joint_state_publisher/joint_state_publisher)                              
    robot_state_publisher (robot_state_publisher/robot_state_publisher)                              
    rviz (rviz/rviz)                                                                                     
urdf_spawner (gazebo_ros/spawn_model)                                                            

ROS_MASTER_URI=http://localhost:11311/                                                                                                                                                                    
process[gazebo-1]: started with pid [4866]                                                          
 process[urdf_spawner-2]: started with pid [4869]                                                     
process[robot_state_publisher-3]: started with pid [4872]                                            
process[joint_state_publisher-4]: started with pid [4873]                                            

ERROR: cannot launch node of type [rviz/rviz]: rviz                                                  
ROS path [0]=/opt/ros/noetic/share/ros                                                               
ROS path [1]=/home/mrosano/DRL-robot-navigation/catkin_ws/src/velodyne_simulator/velodyne_simulator  
ROS path [2]=/home/mrosano/DRL-robot-navigation/catkin_ws/src/velodyne_simulator/velodyne_gazebo_plug
ins                                                                                                  
ROS path [3]=/home/mrosano/DRL-robot-navigation/catkin_ws/src/velodyne_simulator/velodyne_description
ROS path [4]=/home/mrosano/DRL-robot-navigation/catkin_ws/src/multi_robot_scenario                   
ROS path [5]=/opt/ros/noetic/share      

[INFO] [1673862965.250758, 0.000000]: Loading model XML from ros parameter robot_description         
[INFO] [1673862965.259042, 0.000000]: Waiting for service /gazebo/spawn_urdf_model                   
[ INFO] [1673862965.383440500]: Finished loading Gazebo ROS API Plugin.                              
[ INFO] [1673862965.384553555]: waitForService: Service [/gazebo/set_physics_properties] has not been
 advertised, waiting...                                                                              
[ INFO] [1673862965.592135178]: waitForService: Service [/gazebo/set_physics_properties] is now avail
able.                                                                                                
[ INFO] [1673862965.613084633, 0.209000000]: Physics dynamic reconfigure ready.                      
[INFO] [1673862965.863569, 0.454000]: Calling service /gazebo/spawn_urdf_model                       
[ INFO] [1673862966.210820555, 0.545000000]: Laser Plugin: The 'robotNamespace' param was empty      
[ INFO] [1673862966.210888445, 0.545000000]: Starting Laser Plugin (ns = r1)                         
[ INFO] [1673862966.211839451, 0.545000000]: Laser Plugin (ns = r1)  <tf_prefix_>, set to ""         
[INFO] [1673862967.565143, 0.545000]: Spawn status: SpawnModel: Successfully spawned entity          
[ INFO] [1673862967.579997761, 0.545000000]: Velodyne laser plugin missing <min_intensity>, defaults to no clipping                                                                                       
[ INFO] [1673862967.583245729, 0.545000000]: Velodyne laser plugin ready, 16 lasers                  
[ INFO] [1673862967.671517444, 0.545000000]: Starting plugin DiffDrive(ns = r1/)                     
[ INFO] [1673862967.671604088, 0.545000000]: DiffDrive(ns = r1/): <rosDebugLevel> = Debug            
[ INFO] [1673862967.672485176, 0.545000000]: DiffDrive(ns = r1/): <tf_prefix> =                      
[DEBUG] [1673862967.672544354, 0.545000000]: DiffDrive(ns = r1/): <commandTopic> = cmd_vel           
[DEBUG] [1673862967.672557302, 0.545000000]: DiffDrive(ns = r1/): <odometryTopic> = odom
[DEBUG] [1673863146.737053553, 0.344000000]: DiffDrive(ns = r1/): <odometryFrame> = odom
[DEBUG] [1673863146.737063680, 0.344000000]: DiffDrive(ns = r1/): <robotBaseFrame> = base_link
[DEBUG] [1673863146.737101240, 0.344000000]: DiffDrive(ns = r1/): <publishWheelTF> = false
[ WARN] [1673863146.737124243, 0.344000000]: DiffDrive(ns = r1/): missing <publishOdomTF> default is true
[DEBUG] [1673863146.737138810, 0.344000000]: DiffDrive(ns = r1/): <publishWheelJointState> = true
[DEBUG] [1673863146.737182678, 0.344000000]: DiffDrive(ns = r1/): <wheelSeparation> = 0.29999999999999999
[DEBUG] [1673863146.737197167, 0.344000000]: DiffDrive(ns = r1/): <wheelDiameter> = 0.17999999999999999
[DEBUG] [1673863146.737207398, 0.344000000]: DiffDrive(ns = r1/): <wheelAcceleration> = 1.8
[DEBUG] [1673863146.737218340, 0.344000000]: DiffDrive(ns = r1/): <wheelTorque> = 20
[DEBUG] [1673863146.737228880, 0.344000000]: DiffDrive(ns = r1/): <updateRate> = 50
[DEBUG] [1673863146.737276799, 0.344000000]: DiffDrive(ns = r1/): <odometrySource> = world := 1
[DEBUG] [1673863146.737318045, 0.344000000]: DiffDrive(ns = r1/): <leftJoint> = left_hub_joint
[DEBUG] [1673863146.737333769, 0.344000000]: DiffDrive(ns = r1/): <rightJoint> = right_hub_joint
[ WARN] [1673863146.737352065, 0.344000000]: GazeboRosDiffDrive Plugin (ns = ) missing <publishTf>, defaults to 1
[ INFO] [1673863146.738562359, 0.344000000]: DiffDrive(ns = r1/): Advertise joint_states
[ INFO] [1673863146.739513416, 0.344000000]: DiffDrive(ns = r1/): Try to subscribe to cmd_vel
[ INFO] [1673863146.743004514, 0.344000000]: DiffDrive(ns = r1/): Subscribe to cmd_vel
[ INFO] [1673863146.743855019, 0.344000000]: DiffDrive(ns = r1/): Advertise odom on odom 
[ INFO] [1673863146.748275301, 0.344000000]: GazeboRosJointStatePublisher is going to publish joint: chassis_swivel_joint
[ INFO] [1673863146.748296304, 0.344000000]: GazeboRosJointStatePublisher is going to publish joint: swivel_wheel_joint
[ INFO] [1673863146.748305153, 0.344000000]: GazeboRosJointStatePublisher is going to publish joint: left_hub_joint
[ INFO] [1673863146.748313570, 0.344000000]: GazeboRosJointStatePublisher is going to publish joint: right_hub_joint
[ INFO] [1673863146.748323532, 0.344000000]: Starting GazeboRosJointStatePublisher Plugin (ns = r1/)!, parent name: r1
[DEBUG] [1673863146.786297382, 0.365000000]: Trying to publish message of type [nav_msgs/Odometry/cd5e73d190d741a2f92e81eda573aca7] on a publisher with type [nav_msgs/Odometry/cd5e73d190d741a2f92e81eda573aca7]
[DEBUG] [1673863146.786354082, 0.365000000]: Trying to publish message of type [sensor_msgs/JointState/3066dcd76a6cfaef579bd0f34173e9fd] on a publisher with type [sensor_msgs/JointState/3066dcd76a6cfaef579bd0f34173e9fd]
[urdf_spawner-2] process has finished cleanly
log file: /root/.ros/log/67fc554c-9584-11ed-aaf8-0242ac110002/urdf_spawner-2*.log

Are the sensors working properly?

I'm not sure if they are working properly. Do you see any issue in the attached log?

And can you visually see the tensorboard?

Yes, I can. I launched tensorboard and I can browse the exposed url from another pc in the same network.

reiniscimurs commented 1 year ago

The rviz issue shouldn't pose a problem but you can turn off the call to the node by deleting or commenting out the following line: https://github.com/reiniscimurs/DRL-robot-navigation/blob/943186fb7f1890700ce215951e92d5cb92031d14/TD3/assets/multi_robot_scenario.launch#L14

I will try to look into this when i have a computer at hand.

adeldennaoui commented 1 year ago

Hi, all. I would like to ask if there has been any updates on the divergence issue Marco was facing. I am facing a similar issue. The robot does not converge to the optimal solution and the average Q decreases while the loss is vibrating and I also ran the training for around 500-600 steps for multiple hours. I am using a CPU (not CUDA) and it is on a VM - don't know if that's an issue though. Everything seems to be fine though. RViz launches when the training starts. When I run 'gzclient', I get the Gazebo simulation, no errors, only warning I get are 'tf_repeated_data ignoring data with redundant timestamp for frame ...' but I don't think it's related to the training behavior.

EDIT: It is defintely an issue related to the CPU. Whenever I try to run the training script, the CPU becomes completely in saturation. I think the only pratical solution would be to make a Linux partition on my computer.

reiniscimurs commented 1 year ago

Hi,

The usual solution is to try to use different seeds and see what works best for you. Another way is to manipulate the bootstrapping distances in the simulation. See if that helps.

tf_repeated_data usually means that something is being published twice. This is not the expected behavior. I suggest checking the rqt tree to see if there are any anomalies. Also, please write down your setup, the installed branch of the repo and any other information that you could provide.

rosanom commented 1 year ago

I still did not solved the issue. Is there some more information I can attach, apart from the log, to understand where the issue could be?

adeldennaoui commented 1 year ago

My setup: ROS Noetic Ubuntu 20.04 CPU: The VM uses around 12GB RAM, 4 cores (I'm not using Nvidia Cuda) I installed the main branch (ROS Noetic + Pioneer robot, the one you talked about in the Medium articles) Pytorch 1.9.1+cpu Tensorboard 2.11.2 I don't know if that's the information you're asking about and thank you for poiting about the tf_repeated_data issue, it could be a reason for the bad behavior. Any ideas how to solve it? I tried changing the seed number and change different things in the code (the exploration, the max_action to slow the robot down when it's heading towards an obstacle, increasing the batch size to the default 100, allow the robot to go backward and letting the sensor sees 360 degrees, increasing the laser_state_dim to 40, etc) but nothing is actually working, although the behavior improved in some trials but most of the time, the robot just heads straight to the obtsacle. What do you mean by manipulating 'the bootstrapping distances in the simulation'?

adeldennaoui commented 1 year ago

Screenshot (395)

That's the rqt_graph

reiniscimurs commented 1 year ago

I installed from scratch and trained the model over the weekend on a local system and it seems to work just fine in my setup. So i will rule out a bug in the repository itself.

@adeldennaoui The ROS topic graph looks fine here. What I would be more interested to see is the rqt tf tree. You can obtain it in rqt with plugins->visualization->tf tree. The log output to terminal is also something that could show some issues, so please post that as well.

@rosanom Your terminal log output looks just fine. The only thing I notice is that the urdf_spawner (gazebo_ros/spawn_model) is without an indent (unlike other nodes). I assume this is just a formatting error when pasting it into the comment? Are you also not using cuda as training device?

So far the common things seem to be the use of a "remote" system and perhaps using cpu for training. I will try running the training without cuda and see what that shows.

adeldennaoui commented 1 year ago

Here is the tf tree, I don't know if anything's off about it (I took this screenshot before the occurance of the tf_repeated_data error, I'll update it after it happens in case the tf tree changes):

Screenshot (397)

I can confirm that besides the use of a remote system (VM for me), I am training with CPU so it'd be interesting to know if the training works with you using a CPU. In any case, I'll install the Nvidia drivers and the CUDA toolkit on the VM and see if that solves the problem and keep you updated. Thank you!

adeldennaoui commented 1 year ago

An update after the occurance of the tf_repeated_data warning:

Screenshot (398)

reiniscimurs commented 1 year ago

I tested the training with CPU over multiple runs and it seems to be more unstable, at least with the set seed by default. train

However, it failed only in one of those runs, and 2 of them showed obstacle avoidance behavior from early on. I only ran the training for about 3 hours in each case, just to see the trend, and not till convergence.

I also realized @adeldennaoui mentioned that training for only 600 steps and a couple of hours. This is not enough to evaluate the convergence of the model. Often the model will start showing the desired behavior in range between 20 to 40 epochs and converge at around 100 epochs. So if the robot is not rotating on the spot (which insinuates that no learning is done whatsoever), it might just not have had enough time to learn. I would suggest training the model for at least 20 epochs, then see if some obstacle avoidance and navigation behavior can be seen and then make a decision. If robot just spins around and does nothing, simply restart the training. If at least some behavior is seen, just continue the training.

@adeldennaoui The TF issue might be caused by some synchronization. Are you using any method to speed up the training? If not, is your simulator capable of running in real time in your VM?

adeldennaoui commented 1 year ago

Screenshot (400)

An update: so I started the training process at around 2pm and it's circa 9:30am where I live (that's more than 19 hours since the training started) and the robot has done 34 epochs and the picture shows that the training seems to be successful, the average Q is starting to increase, the reward reward during the validation is starting to increase and become positive (it used to be only negative up the around the 25th epoch) and the behavior of the robot improved drastically but that's despire the tf warnings - very happy about that but it is unfortunate that it took my VM that much time (I used the default seed too).

reiniscimurs commented 1 year ago

@adeldennaoui It is good that we can see that it actually works after the amount of epochs I would expect. However, 19 hours for 34 epochs is really a long time. I suspect that the VM you are using simply does not have enough resources to run the Gazebo simulation in real-time and there is a significant slowdown. If you look at your wall time it is around 19 hours, however the actual ROS execution time is below 6 hours in real time. This is also suggested by the TF errors, as the publishing frequencies do not align. You might want to check how well ROS runs on your VM and try to improve the situation there.

reiniscimurs / DRL-robot-navigation

run on a headless server #42