TypeSupport::deserializeROSmessage(..) crashes deserialising msg published by rmw_microxrcedds

gavanderhoorn commented 2 years ago

Bug report

tl;dr: rmw_fastrtps_cpp::TypeSupport::deserializeROSmessage(..) appears to attempt to resize(..) the MultiDOFJointTrajectoryPoint::transforms field to an unexpectedly large value, suggesting either serialisation errors on the rmw_microxrcedds side, or state/memory corruption or alignment issues on the rmw_fastrtps side.

'Evidence' gathered from Wireshark captures suggests serialisation errors are unlikely, as hex dumps/dissections of (XRCE-)DDS traffic appears to be as-it-should-be.

Sometimes, the crash occurs after a few messages. Other times, it takes 50 minutes or more before the crash happens.

Original: let me start by saying I was unsure as to where to post this. Either this issue tracker (ie: rmw_fastrtps) or the rmw_microxrcedds one. In the end, as I only observe the crashes in rmw_fastrtps, I decided to post here.

I realise the fact rmw_microxrcedds is involved means the problem could also be on that side, but my debugging thus far appears to suggest that's not the case (see discussion below).

Required Info:

Operating System:
- Ubuntu Jammy (but also observed on Focal)
Installation type:
- apt (ie: binary)
Version or commit hash:
- "latest" packages for Humble (and Galactic) as of 2022-11-03 (sorry for being so vague)
DDS implementation:
- FastDDS on action client side, rmw_microxrcedds on action server side
Client library (if applicable):
- so far only observed with rclcpp (client side)

Steps to reproduce issue

I'd prefer sharing an MRE as well, but don't have one at the moment. In addition, the crashes have been hard to reproduce.

The best description I can provide:

create a micro-ROS FollowJointTrajectory action server
create a regular ROS 2 FollowJointTrajectory action client (registering feedback callback is not required)
periodically submit new goals to server from client, waiting for their completion

Expected behavior

Client can submit an "infinite" number of goals.

Server processes those goals while publishing feedback to the client.

Actual behavior

Client crashes after a "random time", completely shutting down the node containing the action client:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

The server does not appear to be affected. After restart of the client, everything continues as normal.

Additional information

I have several backtraces of the crashes (some with debug symbols). I've included them at the end of this post.

Some observations from these traces:

all contain references to rmw_fastrtps_cpp::TypeSupport::deserializeROSmessage(..) and eventually cdr_deserialize(..) et al.
all appear to crash at the same point: rosidl_typesupport_fastrtps_cpp/trajectory_msgs/msg/detail/dds_fastrtps/multi_dof_joint_trajectory_point__type_support.cpp:179. This is generated code. I've included the code around that line below, after the GDB traces
all appear to attempt to resize(..) the MultiDOFJointTrajectoryPoint::transforms field to a "ridiculously" large value (1157637866 for instance in the third crash, frame #14)

Some more information about the server: the action server does not populate any of the multi_dof_* fields (these). It does initialise them to either [], "" or 0. The transforms field/list is completely empty, as are all the others.

I have Wireshark captures of all of the messages exchanged between the server and the client, and -- for the feedback messages I've checked -- none of those appear to contain any "garbage" or unexpected data.

I've used jseldent/wireshark-dds-xrce to dissect the data coming from the server, combined with gavanderhoorn/micro_xrce_dds_ros_idl_dissectors to dissect the actual ROS 2 message data.

As it starts to look like the RMW deserialiser is somehow deserialising corrupt/incorrect data, I tried to look for the values for __new_size mentioned in the backtraces in the Wireshark captures as binary data. I noticed the following:

`#`	`__new_size`	hex l-endian	found	where?
`1`	`1157681529`	`0x79D10045`	Y	`FollowJointTrajectory_Feedback` message, `request_id` + `object_id` fields
`2`	`1157646959`	`0x6F4A0045`	Y	`FollowJointTrajectory_Feedback` message, `request_id` + `object_id` fields
`3`	`1157637866`	`0xEA260045`	Y	`FollowJointTrajectory_Feedback` message, `request_id` + `object_id` fields
`4`	`1668008540`	`0x5CCA6B63`	Y	multiple messages, multiple fields

All of the values could be found in the network traffic, but in 'unexpected' places. Sometimes at the start of a payload, sometimes even before the actual ROS message payload, sometimes at a 'random' location in the serialised ROS message payload.

The request_id + object_id fields mentioned in the where? column are micro-ROS specific (actually: XRCE-DDS specific) packet fields, which always occur before the serialised ROS message payload.

The last value appears to be a Unix Epoch and converts to Wednesday, November 9, 2022 4:42:20 PM GMT+01:00. This actually corresponds to the day the Wireshark capture was recorded.

I'm hoping this is sufficient information to at least trigger some people here and start a discussion. As the crashes are not easily reproduced, debugging has been difficult. Individually, all steps investigated (ie: network traffic, (de)serialisation, etc) appear to work correctly.

Perhaps accumulated internal state of the RMW causes a desync/memory corruption somewhere?

GDB backtraces:

Crash 1

```gdb Thread 1 "my_node" received signal SIGABRT, Aborted. __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737339898176) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737339898176) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140737339898176) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140737339898176, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007ffff7642476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007ffff76287f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x00007ffff7aa2bbe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00007ffff7aae24c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x00007ffff7aae2b7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x00007ffff7aae518 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x00007ffff7aa27cc in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x00007ffff664b6ae in __gnu_cxx::new_allocator > >::allocate (this=0x555555985880, __n=) at /usr/include/c++/11/ext/new_allocator.h:103 #11 std::allocator_traits > > >::allocate (__n=, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:464 #12 std::_Vector_base > ::allocator > > >::_M_allocate (__n=, this=) at /usr/include/c++/11/bits/stl_vector.h:346 #13 std::vector >, std::allocator > > >::_M_default_append (__n=1157681529, this=0x555555985880) at /usr/include/c++/11/bits/vector.tcc:635 #14 std::vector >, std::allocator > > >::resize ( __new_size=1157681529, this=0x555555985880) at /usr/include/c++/11/bits/stl_vector.h:940 #15 trajectory_msgs::msg::typesupport_fastrtps_cpp::cdr_deserialize (cdr=..., ros_message=...) at ./.obj-x86_64-linux-gnu/rosidl_typesupport_fastrtps_cpp/trajectory_msgs/msg/detail/dds_fastrtps/multi_dof_joint_trajectory_point__type_support.cpp:179 #16 0x00007ffff11d5697 in control_msgs::action::typesupport_fastrtps_cpp::cdr_deserialize (cdr=..., ros_message=...) at ./obj-x86_64-linux-gnu/rosidl_typesupport_fastrtps_cpp/control_msgs/action/detail/dds_fastrtps/follow_joint_trajectory__type_support.cpp:1153 #17 0x00007ffff11d6181 in control_msgs::action::typesupport_fastrtps_cpp::cdr_deserialize (cdr=..., ros_message=...) at ./obj-x86_64-linux-gnu/rosidl_typesupport_fastrtps_cpp/control_msgs/action/detail/dds_fastrtps/follow_joint_trajectory__type_support.cpp:2739 #18 0x00007ffff7245959 in rmw_fastrtps_cpp::TypeSupport::deserializeROSmessage (this=0x55555588d3c0, deser=..., ros_message=0x555555985640, impl=0x7ffff11e3200 ) at ./src/type_support_common.cpp:110 #19 0x00007ffff71f3ab6 in rmw_fastrtps_shared_cpp::TypeSupport::deserialize(eprosima::fastrtps::rtps::SerializedPayload_t*, void*) () from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so #20 0x00007ffff6e2ab4a in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6 #21 0x00007ffff6aee29b in eprosima::fastdds::dds::DataReaderImpl::read_or_take_next_sample(void*, eprosima::fastdds::dds::SampleInfo*, bool) () from /opt/ros/humble/lib/libfastrtps.so.2.6 #22 0x00007ffff71ef8e9 in rmw_fastrtps_shared_cpp::_take(char const*, rmw_subscription_s const*, void*, bool*, rmw_message_info_s*, rmw_subscription_allocation_s*) () from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so #23 0x00007ffff78fd3ba in rcl_take (subscription=0x555555a12578, ros_message=ros_message@entry=0x555555985640, message_info=message_info@entry=0x7fffffffbe40, allocation=allocation@entry=0x0) at ./src/rcl/subscription.c:516 #24 0x00007ffff7922e55 in rcl_action_take_feedback ( action_client=0x555555a36510, ros_feedback=0x555555985640) at ./src/rcl_action/action_client.c:438 #25 0x00007ffff794a1db in rclcpp_action::ClientBase::take_data ( this=0x555555a36530) at ./src/client.cpp:551 #26 0x00007ffff7dc787f in rclcpp::Executor::get_next_ready_executable_from_map (this=0x7fffffffd180, any_executable=..., weak_groups_to_nodes=std::map with 5 elements = {...}) at ./src/rclcpp/executor.cpp:864 #27 0x00007ffff7dcf632 in rclcpp::executors::MultiThreadedExecutor::run ( this=0x7fffffffd180, this_thread_number=) at ./src/rclcpp/executors/multi_threaded_executor.cpp:85 #28 0x00007ffff7dcfa95 in rclcpp::executors::MultiThreadedExecutor::spin ( this=0x7fffffffd180) at ./src/rclcpp/executors/multi_threaded_executor.cpp:62 #29 0x000055555555883a in main () ```

Crash 2

```gdb [Switching to Thread 0x7fffd2ffd640 (LWP 831209)] __pthread_kill_implementation (no_tid=0, signo=6, threadid=140736733369920) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140736733369920) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140736733369920) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140736733369920, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007ffff7642476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007ffff76287f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x00007ffff7aa2bbe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00007ffff7aae24c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x00007ffff7aae2b7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x00007ffff7aae518 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x00007ffff7aa27cc in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x00007ffff664b6ae in __gnu_cxx::new_allocator > >::allocate (this=0x7fffc40015c0, __n=) at /usr/include/c++/11/ext/new_allocator.h:103 #11 std::allocator_traits > > >::allocate (__n=, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:464 #12 std::_Vector_base >, ::allocator > > >::_M_allocate (__n=, this=) at /usr/include/c++/11/bits/stl_vector.h:346 #13 std::vector >, std::allocator > > >::_M_default_append (__n=1157646959, this=0x7fffc40015c0) at /usr/include/c++/11/bits/vector.tcc:635 #14 std::vector >, std::allocator > > >::resize ( __new_size=1157646959, this=0x7fffc40015c0) at /usr/include/c++/11/bits/stl_vector.h:940 #15 trajectory_msgs::msg::typesupport_fastrtps_cpp::cdr_deserialize (cdr=..., ros_message=...) at ./.obj-x86_64-linux-gnu/rosidl_typesupport_fastrtps_cpp/trajectory_msgs/msg/detail/dds_fastrtps/multi_dof_joint_trajectory_point__type_support.cpp:179 #16 0x00007ffff11d5697 in control_msgs::action::typesupport_fastrtps_cpp::cdr_deserialize (cdr=..., ros_message=...) at ./obj-x86_64-linux-gnu/rosidl_typesupport_fastrtps_cpp/control_msgs/action/detail/dds_fastrtps/follow_joint_trajectory__type_support.cpp:1153 #17 0x00007ffff11d6181 in control_msgs::action::typesupport_fastrtps_cpp::cdr_deserialize (cdr=..., ros_message=...) at ./obj-x86_64-linux-gnu/rosidl_typesupport_fastrtps_cpp/control_msgs/action/detail/dds_fastrtps/follow_joint_trajectory__type_support.cpp:2739 #18 0x00007ffff7245959 in rmw_fastrtps_cpp::TypeSupport::deserializeROSmessage (this=0x5555559bc0d0, deser=..., ros_message=0x7fffc4001380, impl=0x7ffff11e3200 ) at ./src/type_support_common.cpp:110 #19 0x00007ffff71f3ab6 in rmw_fastrtps_shared_cpp::TypeSupport::deserialize(eprosima::fastrtps::rtps::SerializedPayload_t*, void*) () from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so #20 0x00007ffff6e2ab4a in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6 #21 0x00007ffff6aee29b in eprosima::fastdds::dds::DataReaderImpl::read_or_take_next_sample(void*, eprosima::fastdds::dds::SampleInfo*, bool) () from /opt/ros/humble/lib/libfastrtps.so.2.6 #22 0x00007ffff71ef8e9 in rmw_fastrtps_shared_cpp::_take(char const*, rmw_subscription_s const*, void*, bool*, rmw_message_info_s*, rmw_subscription_allocation_s*) () from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so #23 0x00007ffff78fd3ba in rcl_take (subscription=0x555555a11528, ros_message=ros_message@entry=0x7fffc4001380, message_info=message_info@entry=0x7fffd2ffaa20, allocation=allocation@entry=0x0) at ./src/rcl/subscription.c:516 #24 0x00007ffff7922e55 in rcl_action_take_feedback ( action_client=0x5555559e78e0, ros_feedback=0x7fffc4001380) at ./src/rcl_action/action_client.c:438 #25 0x00007ffff794a1db in rclcpp_action::ClientBase::take_data ( this=0x5555559e7900) at ./src/client.cpp:551 #26 0x00007ffff7dc787f in rclcpp::Executor::get_next_ready_executable_from_map (this=0x7fffffffd180, any_executable=..., weak_groups_to_nodes=std::map with 5 elements = {...}) at ./src/rclcpp/executor.cpp:864 #27 0x00007ffff7dcf632 in rclcpp::executors::MultiThreadedExecutor::run ( this=0x7fffffffd180, this_thread_number=) at ./src/rclcpp/executors/multi_threaded_executor.cpp:85 #28 0x00007ffff7adc2b3 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #29 0x00007ffff7694b43 in start_thread (arg=) at ./nptl/pthread_create.c:442 #30 0x00007ffff7726a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 (gdb) ```

Crash 3

```gdb Thread 14 "my_node" received signal SIGABRT, Aborted. [Switching to Thread 0x7fffd3fff640 (LWP 855301)] __pthread_kill_implementation (no_tid=0, signo=6, threadid=140736750155328) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140736750155328) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140736750155328) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140736750155328, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007ffff7642476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007ffff76287f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x00007ffff7aa2bbe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00007ffff7aae24c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x00007ffff7aae2b7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x00007ffff7aae518 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x00007ffff7aa27cc in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x00007ffff664b6ae in __gnu_cxx::new_allocator > >::allocate (this=0x7fffcc0018c0, __n=) at /usr/include/c++/11/ext/new_allocator.h:103 #11 std::allocator_traits > > >::allocate (__n=, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:464 #12 std::_Vector_base >, ::allocator > > >::_M_allocate (__n=, this=) at /usr/include/c++/11/bits/stl_vector.h:346 #13 std::vector >, std::allocator > > >::_M_default_append (__n=1157637866, this=0x7fffcc0018c0) at /usr/include/c++/11/bits/vector.tcc:635 #14 std::vector >, std::allocator > > >::resize ( __new_size=1157637866, this=0x7fffcc0018c0) at /usr/include/c++/11/bits/stl_vector.h:940 #15 trajectory_msgs::msg::typesupport_fastrtps_cpp::cdr_deserialize (cdr=..., ros_message=...) at ./.obj-x86_64-linux-gnu/rosidl_typesupport_fastrtps_cpp/trajectory_msgs/msg/detail/dds_fastrtps/multi_dof_joint_trajectory_point__type_support.cpp:179 #16 0x00007ffff11d5697 in control_msgs::action::typesupport_fastrtps_cpp::cdr_deserialize (cdr=..., ros_message=...) at ./obj-x86_64-linux-gnu/rosidl_typesupport_fastrtps_cpp/control_msgs/action/detail/dds_fastrtps/follow_joint_trajectory__type_support.cpp:1153 #17 0x00007ffff11d6181 in control_msgs::action::typesupport_fastrtps_cpp::cdr_deserialize (cdr=..., ros_message=...) at ./obj-x86_64-linux-gnu/rosidl_typesupport_fastrtps_cpp/control_msgs/action/detail/dds_fastrtps/follow_joint_trajectory__type_support.cpp:2739 #18 0x00007ffff7245959 in rmw_fastrtps_cpp::TypeSupport::deserializeROSmessage (this=0x5555559040a0, deser=..., ros_message=0x7fffcc001680, impl=0x7ffff11e3200 ) at ./src/type_support_common.cpp:110 #19 0x00007ffff71f3ab6 in rmw_fastrtps_shared_cpp::TypeSupport::deserialize(eprosima::fastrtps::rtps::SerializedPayload_t*, void*) () from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so #20 0x00007ffff6e2ab4a in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6 #21 0x00007ffff6aee29b in eprosima::fastdds::dds::DataReaderImpl::read_or_take_next_sample(void*, eprosima::fastdds::dds::SampleInfo*, bool) () from /opt/ros/humble/lib/libfastrtps.so.2.6 #22 0x00007ffff71ef8e9 in rmw_fastrtps_shared_cpp::_take(char const*, rmw_subscription_s const*, void*, bool*, rmw_message_info_s*, rmw_subscription_allocation_s*) () from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so #23 0x00007ffff78fd3ba in rcl_take (subscription=0x5555558d6a88, ros_message=ros_message@entry=0x7fffcc001680, message_info=message_info@entry=0x7fffd3ffca20, allocation=allocation@entry=0x0) at ./src/rcl/subscription.c:516 #24 0x00007ffff7922e55 in rcl_action_take_feedback ( action_client=0x555555992fc0, ros_feedback=0x7fffcc001680) at ./src/rcl_action/action_client.c:438 #25 0x00007ffff794a1db in rclcpp_action::ClientBase::take_data ( this=0x555555992fe0) at ./src/client.cpp:551 #26 0x00007ffff7dc787f in rclcpp::Executor::get_next_ready_executable_from_map (this=0x7fffffffd180, any_executable=..., weak_groups_to_nodes=std::map with 5 elements = {...}) at ./src/rclcpp/executor.cpp:864 #27 0x00007ffff7dcf632 in rclcpp::executors::MultiThreadedExecutor::run ( this=0x7fffffffd180, this_thread_number=) at ./src/rclcpp/executors/multi_threaded_executo ```

Crash 4

```gdb Thread 15 "my_node" received signal SIGABRT, Aborted. [Switching to Thread 0x7fffa3ffe640 (LWP 858961)] __pthread_kill_implementation (no_tid=0, signo=6, threadid=140735944844864) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140735944844864) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140735944844864) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140735944844864, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007ffff7642476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007ffff76287f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x00007ffff7aa2bbe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00007ffff7aae24c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x00007ffff7aae2b7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x00007ffff7aae518 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x00007ffff7aa27cc in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x00007ffff664b6ae in __gnu_cxx::new_allocator > >::allocate (this=0x7fff680018f0, __n=) at /usr/include/c++/11/ext/new_allocator.h:103 #11 std::allocator_traits > > >::allocate (__n=, __a=...) at /usr/include/c++/11/bits/alloc_traits.h:464 #12 std::_Vector_base >, ::allocator > > >::_M_allocate (__n=, this=) at /usr/include/c++/11/bits/stl_vector.h:346 #13 std::vector >, std::allocator > > >::_M_default_append (__n=1668008540, this=0x7fff680018f0) at /usr/include/c++/11/bits/vector.tcc:635 #14 std::vector >, std::allocator > > >::resize ( __new_size=1668008540, this=0x7fff680018f0) at /usr/include/c++/11/bits/stl_vector.h:940 #15 trajectory_msgs::msg::typesupport_fastrtps_cpp::cdr_deserialize (cdr=..., ros_message=...) at ./.obj-x86_64-linux-gnu/rosidl_typesupport_fastrtps_cpp/trajectory_msgs/msg/detail/dds_fastrtps/multi_dof_joint_trajectory_point__type_support.cpp:179 #16 0x00007fffd21d6697 in control_msgs::action::typesupport_fastrtps_cpp::cdr_deserialize (cdr=..., ros_message=...) at ./obj-x86_64-linux-gnu/rosidl_typesupport_fastrtps_cpp/control_msgs/action/detail/dds_fastrtps/follow_joint_trajectory__type_support.cpp:1153 #17 0x00007fffd21d7181 in control_msgs::action::typesupport_fastrtps_cpp::cdr_deserialize (cdr=..., ros_message=...) at ./obj-x86_64-linux-gnu/rosidl_typesupport_fastrtps_cpp/control_msgs/action/detail/dds_fastrtps/follow_joint_trajectory__type_support.cpp:2739 #18 0x00007ffff7245959 in rmw_fastrtps_cpp::TypeSupport::deserializeROSmessage (this=0x55555592f5b0, deser=..., ros_message=0x7fff680016b0, impl=0x7fffd21e4200 ) at ./src/type_support_common.cpp:110 #19 0x00007ffff71f3ab6 in rmw_fastrtps_shared_cpp::TypeSupport::deserialize(eprosima::fastrtps::rtps::SerializedPayload_t*, void*) () from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so #20 0x00007ffff6e2ab4a in ?? () from /opt/ros/humble/lib/libfastrtps.so.2.6 #21 0x00007ffff6aee29b in eprosima::fastdds::dds::DataReaderImpl::read_or_take_next_sample(void*, eprosima::fastdds::dds::SampleInfo*, bool) () from /opt/ros/humble/lib/libfastrtps.so.2.6 #22 0x00007ffff71ef8e9 in rmw_fastrtps_shared_cpp::_take(char const*, rmw_subscription_s const*, void*, bool*, rmw_message_info_s*, rmw_subscription_allocation_s*) () from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so #23 0x00007ffff78fd3ba in rcl_take (subscription=0x555555a03ad8, ros_message=ros_message@entry=0x7fff680016b0, message_info=message_info@entry=0x7fffa3ffba20, allocation=allocation@entry=0x0) at ./src/rcl/subscription.c:516 #24 0x00007ffff7922e55 in rcl_action_take_feedback ( action_client=0x555555a327f0, ros_feedback=0x7fff680016b0) at ./src/rcl_action/action_client.c:438 #25 0x00007ffff794a1db in rclcpp_action::ClientBase::take_data ( this=0x555555a32810) at ./src/client.cpp:551 #26 0x00007ffff7dc787f in rclcpp::Executor::get_next_ready_executable_from_map (this=0x7fffffffd020, any_executable=..., weak_groups_to_nodes=std::map with 5 elements = {...}) at ./src/rclcpp/executor.cpp:864 #27 0x00007ffff7dcf632 in rclcpp::executors::MultiThreadedExecutor::run ( this=0x7fffffffd020, this_thread_number=) at ./src/rclcpp/executors/multi_threaded_executo ```

Crashing code

Click to expand

```c++ bool ROSIDL_TYPESUPPORT_FASTRTPS_CPP_PUBLIC_trajectory_msgs cdr_deserialize( eprosima::fastcdr::Cdr & cdr, trajectory_msgs::msg::MultiDOFJointTrajectoryPoint & ros_message) { // Member: transforms { uint32_t cdrSize; cdr >> cdrSize; size_t size = static_cast(cdrSize); ros_message.transforms.resize(size); // <<---- line 179 for (size_t i = 0; i < size; i++) { geometry_msgs::msg::typesupport_fastrtps_cpp::cdr_deserialize( cdr, ros_message.transforms[i]); } } // Member: velocities { uint32_t cdrSize; cdr >> cdrSize; size_t size = static_cast(cdrSize); ros_message.velocities.resize(size); for (size_t i = 0; i < size; i++) { geometry_msgs::msg::typesupport_fastrtps_cpp::cdr_deserialize( cdr, ros_message.velocities[i]); } } // Member: accelerations { uint32_t cdrSize; cdr >> cdrSize; size_t size = static_cast(cdrSize); ros_message.accelerations.resize(size); for (size_t i = 0; i < size; i++) { geometry_msgs::msg::typesupport_fastrtps_cpp::cdr_deserialize( cdr, ros_message.accelerations[i]); } } // Member: time_from_start builtin_interfaces::msg::typesupport_fastrtps_cpp::cdr_deserialize( cdr, ros_message.time_from_start); return true; } ```

gavanderhoorn commented 2 years ago

And just to show it: example hex dump of a datagram containing the serialised version of a control_msgs/action/FollowJointTrajectory::Feedback message. Below it is the dissected version Wireshark shows.

Note: this is just the XRCE-DDS WRITE_DATA serialized payload data, which should be what RMW FastRTPS also "sees" IIUC. It does not include the XRCE-DDS header(s).

All fields related to multi_dof are either 0, [empty] or "" as expected.

Hex dump (click to expand)

``` 0000 c2 92 da 52 e1 d3 7d 27 5b 5b 49 52 44 3d d2 b0 ...R..}'[[IRD=.. 0010 72 c5 6b 63 ff 66 55 36 01 00 00 00 00 00 00 00 r.kc.fU6........ 0020 06 00 00 00 0a 00 00 00 6a 6f 69 6e 74 5f 31 5f ........joint_1_ 0030 73 00 00 00 0a 00 00 00 6a 6f 69 6e 74 5f 32 5f s.......joint_2_ 0040 6c 00 00 00 0a 00 00 00 6a 6f 69 6e 74 5f 33 5f l.......joint_3_ 0050 75 00 00 00 0a 00 00 00 6a 6f 69 6e 74 5f 34 5f u.......joint_4_ 0060 72 00 00 00 0a 00 00 00 6a 6f 69 6e 74 5f 35 5f r.......joint_5_ 0070 62 00 50 30 0a 00 00 00 6a 6f 69 6e 74 5f 36 5f b.P0....joint_6_ 0080 74 00 31 2f 06 00 00 00 af 58 8e 07 59 57 f2 bf t.1/.....X..YW.. 0090 a2 00 f3 48 d9 e8 c6 3f be c7 b5 7c fc 8e dd bf ...H...?...|.... 00a0 00 8a 03 b8 49 7b 9e 3f d6 54 a7 5a 51 3e ec bf ....I{.?.T.ZQ>.. 00b0 20 c1 a0 16 ba b2 05 40 00 00 00 00 00 00 00 00 ......@........ 00c0 00 00 00 00 00 00 00 00 00 00 00 00 06 00 00 00 ................ 00d0 a1 e0 0f b2 f9 5d f2 bf 39 06 1e ea e5 80 c6 3f .....]..9......? 00e0 05 ce 63 9e e5 87 dd bf 6e 2d c6 99 4c 3b a3 3f ..c.....n-..L;.? 00f0 0c 6b 69 c2 3c c0 eb bf 39 e1 45 15 1e 7b 05 40 .ki.<...9.E..{.@ 0100 06 00 00 00 72 6f 73 32 5e be ee 61 a5 a8 93 3f ....ros2^..a...? 0110 b8 cd 99 1c f1 35 a3 3f ac 83 78 f8 a4 04 79 3f .....5.?..x...y? 0120 0b 9c bc 86 41 05 b9 3f 4d ba b4 43 7b 4b c8 3f ....A..?M..C{K.? 0130 fb 31 8a 3d 19 6b d5 3f 00 00 00 00 06 00 00 00 .1.=.k.?........ 0140 73 bd 6d a6 42 3c d2 3f 37 36 3b 52 7d e7 d3 bf s.m.B<.?76;R}... 0150 4f 3c 67 0b 08 ad e1 3f b3 ea 73 b5 15 fb c3 bf O...i? 0190 00 e4 e6 47 79 5b 3c bf 70 43 23 ee 3d ed 7f bf ...Gy[<.pC#.=... 01a0 80 72 7a 0f 26 85 8f bf 80 f3 6f ad 00 ce 9b 3f .rz.&.....o....? 01b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 01c0 00 00 00 00 00 00 00 00 00 00 00 00[00 00 00 00 ................ <-- multi_dof data 01d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ | 01e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ | 01f0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ | 0200 00 00 00 00] .... _ ```

Wireshark dissection (click to expand)

```yaml data serialized_data (control_msgs/action/FollowJointTrajectory_Feedback) header: stamp: sec: 1668007282 nanosec: 911566591 frame_id: "" joint_names: [0] "joint_1_s" [1] "joint_2_l" [2] "joint_3_u" [3] "joint_4_r" [4] "joint_5_b" [5] "joint_6_t" desired: positions: [0] -1,1463251395376 [1] 0,17898098051046 [2] -0,46185218981475 [3] 0,029767181259453 [4] -0,88260715204037 [5] 2,712268997937 velocities: [empty] accelerations: [empty] effort: [empty] time_from_start: sec: 0 nanosec: 0 actual: positions: [0] -1,1479432063625 [1] 0,17580865794666 [2] -0,46141949146042 [3] 0,03756179215551 [4] -0,86721647234967 [5] 2,6851159727049 velocities: [0] 0,019198021032649 [1] 0,037520918599422 [2] 0,0061079448856 [3] 0,097736449618069 [4] 0,18980351263004 [5] 0,33466177949666 accelerations: [empty] effort: [0] 0,284928 [1] -0,311004 [2] 0,552372 [3] -0,1561 [4] -0,096529 [5] 0,119686 time_from_start: sec: 0 nanosec: 0 error: positions: [0] 0,0016180668249004 [1] 0,0031723225638019 [2] -0,00043269835432685 [3] -0,0077946108960568 [4] -0,015390679690705 [5] 0,027153025232099 velocities: [empty] accelerations: [empty] effort: [empty] time_from_start: sec: 0 nanosec: 0 multi_dof_joint_names: [empty] multi_dof_desired: transforms: [empty] velocities: [empty] accelerations: [empty] time_from_start: sec: 0 nanosec: 0 multi_dof_actual: transforms: [empty] velocities: [empty] accelerations: [empty] time_from_start: sec: 0 nanosec: 0 multi_dof_error: transforms: [empty] velocities: [empty] accelerations: [empty] time_from_start: sec: 0 nanosec: 0 ```

pablogs9 commented 2 years ago

Some questions:

Could you show the same as in the previous message (hex dump and dissection) of the message that produces the error?
What are the sizes of the serialized message, your XRCE-DDS buffers (MTU), and the number of XRCE buffers?
Which transport are you using at the micro-ROS level?
Are you using a multithreaded micro-ROS app?

gavanderhoorn commented 2 years ago

Could you show the same as in the previous message (hex dump and dissection) of the message that produces the error?

I can't be sure it's the exact message that causes the crash. I do know it's the only FollowJointTrajectory::Feedback message in the third Wireshark capture which contains __new_size in hex (ie: 0xEA260045).

As I write later (in response to your question): the complete msg consists of two fragments.

Hex dumps

These are the complete XRCE-DDS protocol payloads, so not just the serialized_data I showed in https://github.com/ros2/rmw_fastrtps/issues/644#issuecomment-1313954347.

First fragment (appears first in capture, 550 bytes on-the-wire, 508 bytes XRCE-DDS protocol data):

Click to expand

``` 0000 81 80 a0 ff 0d 01 f4 01 07 01 08 02 ea 26 00 45 .............&.E 0010 c2 92 da 52 e1 d3 7d 27 5b 5b 49 52 44 3d d2 b0 ...R..}'[[IRD=.. 0020 71 c5 6b 63 ad b2 5c 1f 01 00 00 00 00 00 00 00 q.kc..\......... 0030 06 00 00 00 0a 00 00 00 6a 6f 69 6e 74 5f 31 5f ........joint_1_ 0040 73 00 00 00 0a 00 00 00 6a 6f 69 6e 74 5f 32 5f s.......joint_2_ 0050 6c 00 00 00 0a 00 00 00 6a 6f 69 6e 74 5f 33 5f l.......joint_3_ 0060 75 00 00 00 0a 00 00 00 6a 6f 69 6e 74 5f 34 5f u.......joint_4_ 0070 72 00 00 00 0a 00 00 00 6a 6f 69 6e 74 5f 35 5f r.......joint_5_ 0080 62 00 00 00 0a 00 00 00 6a 6f 69 6e 74 5f 36 5f b.......joint_6_ 0090 74 00 ca bf 06 00 00 00 2e 7b e8 35 24 33 f3 bf t........{.5$3.. 00a0 04 08 4d 4d 91 8f b1 3f 12 20 f9 5a be 9d dc bf ..MM...?. .Z.... 00b0 25 95 42 9b 99 f1 d3 3f dd b0 16 5d f9 14 d5 bf %.B....?...].... 00c0 d2 69 50 e7 3d c7 fb 3f 00 00 00 00 00 00 00 00 .iP.=..?........ 00d0 00 00 00 00 00 00 00 00 00 00 00 00 06 00 00 00 ................ 00e0 41 71 99 0f 79 5e f3 bf 5a 9e 55 4e 77 fa a7 3f Aq..y^..Z.UNw..? 00f0 94 f4 3a 53 2c 6e dc bf 76 86 75 98 08 80 d7 3f ..:S,n..v.u....? 0100 35 40 ed cd 15 34 cc bf cb 1c 54 39 ef b2 f8 3f 5@...4....T9...? 0110 06 00 00 00 02 9d 49 9b f5 6d 6a ba ce 92 a8 3f ......I..mj....? 0120 48 e6 14 dd 28 cc b8 3f ac 83 78 f8 a4 04 89 3f H...(..?..x....? 0130 50 64 c2 1f aa 15 d0 3f 88 7c 52 d6 f9 86 df 3f Pd.....?.|R....? 0140 10 11 dc 4f 89 c5 eb 3f 00 00 00 00 06 00 00 00 ...O...?........ 0150 4a 77 d7 d9 90 7f da 3f ac c4 3c 2b 69 c5 b7 3f Jw.....?..<+i..? 0160 ff 3d 78 ed d2 06 e3 3f c0 95 ec d8 08 c4 d3 bf .=x....?........ 0170 0f 5f 26 8a 90 ba c3 bf 27 d8 7f 9d 9b 36 cf 3f ._&.....'....6.? 0180 00 00 00 00 00 00 00 00 06 00 00 00 74 63 70 5f ............tcp_ 0190 00 7c 42 dd 72 48 85 3f c8 cb 6c 1a 95 e2 95 3f .|B.rH.?..l....? 01a0 80 c0 f4 2b ce 5f 67 bf 70 11 5f 30 96 ef ab bf ...+._g.p._0.... 01b0 52 d4 13 5d 16 6b bb bf 60 23 fe 89 ae 30 c8 3f R..].k..`#...0.? 01c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 01d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 01e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 01f0 00 00 00 00 00 00 00 00 00 00 00 00 ............ ```

Note EA 26 00 45 at 0x000C.

Second (and last) fragment (74 bytes on-the-wire, 32 bytes XRCE-DDS protocol data):

Click to expand

``` 0000 81 80 a1 ff 0d 03 18 00 00 00 00 00 00 00 00 00 ................ 0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ ```

Wireshark dissection

This is the complete message, so reassembled by the dissector (including XRCE-DDS fields):

Click to expand

```yaml DDS-XRCE Protocol, sessionId: 0x81, streamId: 0x80 sessionId: 0x81 streamId: 0x80 (STREAMID_BUILTIN_RELIABLE) sequenceNr: 65441 FRAGMENT submessageId: 13 flags: 0x03, Last Fragment bit, Endianness bit submessageLength: 24 payload: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 WRITE_DATA submessageId: 7 flags: 0x01, FORMAT_DATA, Endianness bit .... 000. = DataFormat: FORMAT_DATA (0x0) .... ...1 = Endianness bit: Set (0x1) submessageLength: 520 payload (WRITE_DATA_Payload_Data) request_id: ea26 object_id: 0045 (DATAWRITER) [type: control_msgs::action::dds_::FollowJointTrajectory_FeedbackMessage_] data serialized_data (control_msgs::action::dds_::FollowJointTrajectory_FeedbackMessage_) header: stamp: sec: 1668007281 nanosec: 526168749 frame_id: "" joint_names: [0] "joint_1_s" [1] "joint_2_l" [2] "joint_3_u" [3] "joint_4_r" [4] "joint_5_b" [5] "joint_6_t" desired: positions: [0] -1,1999857049738 [1] 0,068596917507449 [2] -0,44712790378662 [3] 0,31162109529608 [4] -0,32940515604674 [5] 1,7361430202205 velocities: [empty] accelerations: [empty] effort: [empty] time_from_start: sec: 0 nanosec: 0 actual: positions: [0] -1,2105646714545 [1] 0,046832779228507 [2] -0,44422443511923 [3] 0,36718954933722 [4] -0,2203395133497 [5] 1,5436851729772 velocities: [0] 0,047995052581623 [1] 0,096865228629121 [2] 0,0122158897712 [3] 0,25132229901789 [4] 0,49261327676681 [5] 0,86786332700334 accelerations: [empty] effort: [0] 0,414036 [1] 0,092856 [2] 0,594583 [3] -0,30884 [4] -0,154131 [5] 0,243854 time_from_start: sec: 0 nanosec: 0 error: positions: [0] 0,010392091157255 [1] 0,021372155893186 [2] -0,0028533007422525 [3] -0,054562276272494 [4] -0,10710277340316 [5] 0,18898565042425 velocities: [empty] accelerations: [empty] effort: [empty] time_from_start: sec: 0 nanosec: 0 multi_dof_joint_names: [empty] multi_dof_desired: transforms: [empty] velocities: [empty] accelerations: [empty] time_from_start: sec: 0 nanosec: 0 multi_dof_actual: transforms: [empty] velocities: [empty] accelerations: [empty] time_from_start: sec: 0 nanosec: 0 multi_dof_error: transforms: [empty] velocities: [empty] accelerations: [empty] time_from_start: sec: 0 nanosec: 0 ```

Note EA26 in the request_id field, and 0045 in the object_id field.

What are the sizes of the serialized message

The complete message consists of two fragments (probably why you ask about the MTU below).

Wireshark tells me:

550 bytes on-the-wire for the first fragment
74 bytes on-the-wire for the second fragment
508 bytes total XRCE-DDS payload for the first fragment
32 bytes total XRCE-DDS payload for the second fragment
500 bytes submessageLength for the first fragment
24 bytes submessageLength for the second fragment
520 bytes submessageLength for the complete message (ie: reassembled)

See also the dissection tree I include above.

your XRCE-DDS buffers (MTU)

Set to UCLIENT_UDP_TRANSPORT_MTU=512 bytes.

the number of XRCE buffers?

Set to RMW_UXRCE_STREAM_HISTORY=160.

Which transport are you using at the micro-ROS level?

UDPv4.

Are you using a multithreaded micro-ROS app?

Multi-tasked (same thing on this platform), but: yes.

gavanderhoorn commented 2 years ago

Oh and perhaps important/interesting to mention: the client doesn't have a feedback callback registered.

Given the fact the backtraces seem to point towards deserialisation failures, I'm not sure that's relevant/affects anything, but I thought I'd mention it in any case.

pablogs9 commented 2 years ago

RMW_UXRCE_STREAM_HISTORY=160 shall be a power of two. Maybe this is not very explicit at the RMW level (we will fix it), but it is in the XRCE-DDS documentation: look for uxr_create_output_reliable_stream here.

Probably this is why your XRCE headers are being mixed with the payload.

Also, I guess that you have micro-ROS multithread flags enabled, right?

gavanderhoorn commented 2 years ago

RMW_UXRCE_STREAM_HISTORY=160 shall be a power of two.

Ok. That's something I'd missed apparently.

Maybe this is not very explicit at the RMW level (we will fix it)

yeah, I just checked and couldn't find it.

it is in the XRCE-DDS documentation: look for uxr_create_output_reliable_stream here.

True.

However, as a user of micro-ROS, I don't call that function anywhere, nor do I use the Client API directly myself. I only interact with the rmw_microxrcedds API. So I hadn't seen it.

Would perhaps be good to both check it and error out in the RMW and update the docs for these options in the CMakeLists.txt of rmw_microxrcedds_c?

I guess that you have micro-ROS multithread flags enabled, right?

Yes. UCLIENT_PROFILE_MULTITHREAD=ON.

Edit: we've also been using this value for about 6 months, and had not encountered any problems.

I'll test a build with a power-of-two value for RMW_UXRCE_STREAM_HISTORYand report back.

gavanderhoorn commented 2 years ago

And some bounds checking on deserialised values might also be nice, but I guess that's hard to do: in some cases a vector with 1157681529 elements will probably be perfectly OK.

(but this is not something specific to this RMW, nor micro-ROS)

pablogs9 commented 2 years ago

Yes, I agree that this shall be written somewhere, but it is not common for micro-ROS users to modify it and we have had no bandwidth to take care of it.

gavanderhoorn commented 2 years ago

I can PR the CMake options docs. The Client would take me longer.

pablogs9 commented 2 years ago

I can PR the CMake options docs. The Client would take me longer.

Appreciate it

gavanderhoorn commented 2 years ago

https://github.com/micro-ROS/rmw_microxrcedds/pull/270

gavanderhoorn commented 2 years ago

Closing as the problem has not come back after changing STREAM_HISTORY to be a power-of-two.

Thanks for the fast response @pablogs9, even if this in the end was not an issue with rmw_fastrtps_cpp.

pablogs9 commented 2 years ago

In any case, IMO this behavior of rmw_fastrtps_cpp or type support shall be fixed.

ros2 / rmw_fastrtps