open-rmf / free_fleet

A free fleet management system.
Apache License 2.0
156 stars 65 forks source link

Server crashes when client is launched #131

Closed siddux closed 1 year ago

siddux commented 1 year ago

I am trying the turtlebot example but the server crashes everytime I launch a client instance with the follwoing error:

[free_fleet_server_ros2-1] terminate called after throwing an instance of 'std::logic_error' [free_fleet_server_ros2-1] what(): basic_string::_M_construct null not valid [ERROR] [free_fleet_server_ros2-1]: process has died [pid 15554, exit code -6, cmd '/home/erius/ff_ws/install/free_fleet_server_ros2/lib/free_fleet_server_ros2/free_fleet_server_ros2 --ros-args -r __node:=turtlebot3_fleet_server_node --params-file /tmp/launch_params_l716rfh6 --params-file /tmp/launch_params_smzkzt_i --params-file /tmp/launch_params_dl3mtr3s --params-file /tmp/launch_params__oqa7rql --params-file /tmp/launch_params_11ipip43 --params-file /tmp/launch_params__g4f4107 --params-file /tmp/launch_params_g29kzvkv --params-file /tmp/launch_params_ddlx904y --params-file /tmp/launch_params_fi4cm0nm --params-file /tmp/launch_params_kjeirlq2 --params-file /tmp/launch_params_obxmqcmx --params-file /tmp/launch_params__am9z1ge --params-file /tmp/launch_params_etzm3_3x --params-file /tmp/launch_params_sy7g1cxg --params-file /tmp/launch_params_6afvothn --params-file /tmp/launch_params_0lg_tq4j'].

I am running both the server and the client on a computer with ubuntu 20.04. The server is running under ROS2 Galactic while the client is running on ROS Noetic. I've tried to run the ROS2 version of the client but it doesn't starts the simulation due to some error on TFs.

Also, I've tried to run the server and the client in different computers connected through a VPN. In that case, the server (server is running on Ubuntu 22.04 & ROS2 Humble) detects the client and registers it but it does not publish /fleet_states topic and I can't send any command using the provided python scripts.

siddux commented 1 year ago

After spending some time debugging the code I've find the error. The update callback function of the ff server calls the following function bool Server::ServerImpl::read_robot_states(std::vector<messages::RobotState>& _new_robot_states) . This function is iterating over the robot stated from the subscriber. In my case, I had only 1 client but the size of the vector was 3. This caused that the 2nd and 3rd elements were null pointers and when the subsequent lines of code try to access them this causes an error.

For a quick fix I've added few lines to avoid accessing null pointer at this point but it's not an elegant solution and also I'm not sure if it can cause any problems on other parts of the package. Probably it will be better to find why the vector has more elements than clients.

aaronchongth commented 1 year ago

Hello @siddux! Looking at

[free_fleet_server_ros2-1]   what():  basic_string::_M_construct null not valid [ERROR] [free_fleet_server_ros2-1]: process has died [pid 15554, exit code -6

I have a hunch that a string is not initialized or filled in the robot state message. Could you run the server with gdb and provide a stack trace of where this happens?

siddux commented 1 year ago

I'll try during the day. Since it happens when running only one client do you know which information should be stored in the other elements of the vector?

aaronchongth commented 1 year ago

Since it is a vector of statically typed message structure, I am quite confident we are still dealing with messages::RobotState, possibly just older messages in the DDS reader. (edit: which would go away once the server starts handling messages)

Since there is no way for me to repeat your exact experiment, it might be best if you could investigate the contents of those elements. Does this happen consistently? I have a hunch the error is occurring when we are trying to populate a std::string with an empty char* from DDS land, which results in a failed construction. Best to check that all the fields in the configs are filled up properly on the client side, as to create the messages properly for publishing

siddux commented 1 year ago

After debugging the whole server implementation I've found where the error comes from. In the DDSSubscribeHandler.hpp file every message received by the subscriber is checked in order to only store and pass to the server the valid messages.

When it receives a valid message, there's not any problem. The problem arises when receiving non-valid messages. Sometimes, despite receiving a non-valid message, the valid_data variable returns an integer (despite it is defined as a boolean). Below you can find a debugger screenshot:

debug3

I've opened an issue on cyclonedds repo, as I don't know if it is a bug on the dds or I'm not understanding how it's been implemented. At this moment I've found a solution, but it's not the most elegant one. It just consists on changing:

if (infos[i].valid_data)

to:

if (infos[i].valid_data == true)

I can create a PR with that solution if you consider it's an appropriate form of solving it.

aaronchongth commented 1 year ago

Hmm, this is an interesting finding, thanks for the debug information. It's a been a long time since the we've made any changes to the implementation, and it looks like it wasn't due to any broken APIs too.

I can create a PR with that solution if you consider it's an appropriate form of solving it.

That would be greatly appreciated!

aaronchongth commented 1 year ago

Closing via https://github.com/open-rmf/free_fleet/pull/132