Closed gavanderhoorn closed 2 years ago
And just to show it: example hex dump of a datagram containing the serialised version of a control_msgs/action/FollowJointTrajectory::Feedback
message. Below it is the dissected version Wireshark shows.
Note: this is just the XRCE-DDS WRITE_DATA
serialized payload data, which should be what RMW FastRTPS also "sees" IIUC. It does not include the XRCE-DDS header(s).
All fields related to multi_dof
are either 0
, [empty]
or ""
as expected.
Some questions:
- Could you show the same as in the previous message (hex dump and dissection) of the message that produces the error?
I can't be sure it's the exact message that causes the crash. I do know it's the only FollowJointTrajectory::Feedback
message in the third Wireshark capture which contains __new_size
in hex (ie: 0xEA260045
).
As I write later (in response to your question): the complete msg consists of two fragments.
These are the complete XRCE-DDS protocol payloads, so not just the serialized_data
I showed in https://github.com/ros2/rmw_fastrtps/issues/644#issuecomment-1313954347.
First fragment (appears first in capture, 550 bytes on-the-wire, 508 bytes XRCE-DDS protocol data):
Note EA 26 00 45
at 0x000C
.
Second (and last) fragment (74 bytes on-the-wire, 32 bytes XRCE-DDS protocol data):
This is the complete message, so reassembled by the dissector (including XRCE-DDS fields):
Note EA26
in the request_id
field, and 0045
in the object_id
field.
- What are the sizes of the serialized message
The complete message consists of two fragments (probably why you ask about the MTU below).
Wireshark tells me:
submessageLength
for the first fragmentsubmessageLength
for the second fragmentsubmessageLength
for the complete message (ie: reassembled)See also the dissection tree I include above.
your XRCE-DDS buffers (MTU)
Set to UCLIENT_UDP_TRANSPORT_MTU=512
bytes.
the number of XRCE buffers?
Set to RMW_UXRCE_STREAM_HISTORY=160
.
- Which transport are you using at the micro-ROS level?
UDPv4.
- Are you using a multithreaded micro-ROS app?
Multi-tasked (same thing on this platform), but: yes.
Oh and perhaps important/interesting to mention: the client doesn't have a feedback callback registered.
Given the fact the backtraces seem to point towards deserialisation failures, I'm not sure that's relevant/affects anything, but I thought I'd mention it in any case.
RMW_UXRCE_STREAM_HISTORY=160
shall be a power of two. Maybe this is not very explicit at the RMW level (we will fix it), but it is in the XRCE-DDS documentation: look for uxr_create_output_reliable_stream
here.
Probably this is why your XRCE headers are being mixed with the payload.
Also, I guess that you have micro-ROS multithread flags enabled, right?
RMW_UXRCE_STREAM_HISTORY=160
shall be a power of two.
Ok. That's something I'd missed apparently.
Maybe this is not very explicit at the RMW level (we will fix it)
yeah, I just checked and couldn't find it.
it is in the XRCE-DDS documentation: look for
uxr_create_output_reliable_stream
here.
True.
However, as a user of micro-ROS, I don't call that function anywhere, nor do I use the Client API directly myself. I only interact with the rmw_microxrcedds
API. So I hadn't seen it.
Would perhaps be good to both check it and error out in the RMW and update the docs for these options in the CMakeLists.txt
of rmw_microxrcedds_c
?
I guess that you have micro-ROS multithread flags enabled, right?
Yes. UCLIENT_PROFILE_MULTITHREAD=ON
.
Edit: we've also been using this value for about 6 months, and had not encountered any problems.
I'll test a build with a power-of-two value for RMW_UXRCE_STREAM_HISTORY
and report back.
And some bounds checking on deserialised values might also be nice, but I guess that's hard to do: in some cases a vector with 1157681529
elements will probably be perfectly OK.
(but this is not something specific to this RMW, nor micro-ROS)
Yes, I agree that this shall be written somewhere, but it is not common for micro-ROS users to modify it and we have had no bandwidth to take care of it.
I can PR the CMake options docs. The Client would take me longer.
I can PR the CMake options docs. The Client would take me longer.
Appreciate it
Closing as the problem has not come back after changing STREAM_HISTORY
to be a power-of-two.
Thanks for the fast response @pablogs9, even if this in the end was not an issue with rmw_fastrtps_cpp
.
In any case, IMO this behavior of rmw_fastrtps_cpp
or type support shall be fixed.
Bug report
tl;dr:
rmw_fastrtps_cpp::TypeSupport::deserializeROSmessage(..)
appears to attempt toresize(..)
the MultiDOFJointTrajectoryPoint::transforms field to an unexpectedly large value, suggesting either serialisation errors on thermw_microxrcedds
side, or state/memory corruption or alignment issues on thermw_fastrtps
side.'Evidence' gathered from Wireshark captures suggests serialisation errors are unlikely, as hex dumps/dissections of (XRCE-)DDS traffic appears to be as-it-should-be.
Sometimes, the crash occurs after a few messages. Other times, it takes 50 minutes or more before the crash happens.
Original: let me start by saying I was unsure as to where to post this. Either this issue tracker (ie:
rmw_fastrtps
) or thermw_microxrcedds
one. In the end, as I only observe the crashes inrmw_fastrtps
, I decided to post here.I realise the fact
rmw_microxrcedds
is involved means the problem could also be on that side, but my debugging thus far appears to suggest that's not the case (see discussion below).Required Info:
apt
(ie: binary)rmw_microxrcedds
on action server siderclcpp
(client side)Steps to reproduce issue
I'd prefer sharing an MRE as well, but don't have one at the moment. In addition, the crashes have been hard to reproduce.
The best description I can provide:
FollowJointTrajectory
action serverFollowJointTrajectory
action client (registering feedback callback is not required)Expected behavior
Client can submit an "infinite" number of goals.
Server processes those goals while publishing feedback to the client.
Actual behavior
Client crashes after a "random time", completely shutting down the node containing the action client:
The server does not appear to be affected. After restart of the client, everything continues as normal.
Additional information
I have several backtraces of the crashes (some with debug symbols). I've included them at the end of this post.
Some observations from these traces:
rmw_fastrtps_cpp::TypeSupport::deserializeROSmessage(..)
and eventuallycdr_deserialize(..)
et al.rosidl_typesupport_fastrtps_cpp/trajectory_msgs/msg/detail/dds_fastrtps/multi_dof_joint_trajectory_point__type_support.cpp:179
. This is generated code. I've included the code around that line below, after the GDB tracesresize(..)
the MultiDOFJointTrajectoryPoint::transforms field to a "ridiculously" large value (1157637866
for instance in the third crash, frame#14
)Some more information about the server: the action server does not populate any of the
multi_dof_*
fields (these). It does initialise them to either[]
,""
or0
. Thetransforms
field/list is completely empty, as are all the others.I have Wireshark captures of all of the messages exchanged between the server and the client, and -- for the feedback messages I've checked -- none of those appear to contain any "garbage" or unexpected data.
I've used jseldent/wireshark-dds-xrce to dissect the data coming from the server, combined with gavanderhoorn/micro_xrce_dds_ros_idl_dissectors to dissect the actual ROS 2 message data.
As it starts to look like the RMW deserialiser is somehow deserialising corrupt/incorrect data, I tried to look for the values for
__new_size
mentioned in the backtraces in the Wireshark captures as binary data. I noticed the following:#
__new_size
1
1157681529
0x79D10045
FollowJointTrajectory_Feedback
message,request_id
+object_id
fields2
1157646959
0x6F4A0045
FollowJointTrajectory_Feedback
message,request_id
+object_id
fields3
1157637866
0xEA260045
FollowJointTrajectory_Feedback
message,request_id
+object_id
fields4
1668008540
0x5CCA6B63
All of the values could be found in the network traffic, but in 'unexpected' places. Sometimes at the start of a payload, sometimes even before the actual ROS message payload, sometimes at a 'random' location in the serialised ROS message payload.
The
request_id
+object_id
fields mentioned in thewhere?
column are micro-ROS specific (actually: XRCE-DDS specific) packet fields, which always occur before the serialised ROS message payload.The last value appears to be a Unix Epoch and converts to
Wednesday, November 9, 2022 4:42:20 PM GMT+01:00
. This actually corresponds to the day the Wireshark capture was recorded.I'm hoping this is sufficient information to at least trigger some people here and start a discussion. As the crashes are not easily reproduced, debugging has been difficult. Individually, all steps investigated (ie: network traffic, (de)serialisation, etc) appear to work correctly.
Perhaps accumulated internal state of the RMW causes a desync/memory corruption somewhere?
GDB backtraces:
Crash 1
```gdb Thread 1 "my_node" received signal SIGABRT, Aborted. __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737339898176) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737339898176) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140737339898176) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140737339898176, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007ffff7642476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007ffff76287f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x00007ffff7aa2bbe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00007ffff7aae24c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x00007ffff7aae2b7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x00007ffff7aae518 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x00007ffff7aa27cc in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x00007ffff664b6ae in __gnu_cxx::new_allocatorCrash 2
```gdb [Switching to Thread 0x7fffd2ffd640 (LWP 831209)] __pthread_kill_implementation (no_tid=0, signo=6, threadid=140736733369920) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140736733369920) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140736733369920) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140736733369920, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007ffff7642476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007ffff76287f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x00007ffff7aa2bbe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00007ffff7aae24c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x00007ffff7aae2b7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x00007ffff7aae518 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x00007ffff7aa27cc in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x00007ffff664b6ae in __gnu_cxx::new_allocatorCrash 3
```gdb Thread 14 "my_node" received signal SIGABRT, Aborted. [Switching to Thread 0x7fffd3fff640 (LWP 855301)] __pthread_kill_implementation (no_tid=0, signo=6, threadid=140736750155328) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140736750155328) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140736750155328) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140736750155328, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007ffff7642476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007ffff76287f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x00007ffff7aa2bbe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00007ffff7aae24c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x00007ffff7aae2b7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x00007ffff7aae518 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x00007ffff7aa27cc in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x00007ffff664b6ae in __gnu_cxx::new_allocatorCrash 4
```gdb Thread 15 "my_node" received signal SIGABRT, Aborted. [Switching to Thread 0x7fffa3ffe640 (LWP 858961)] __pthread_kill_implementation (no_tid=0, signo=6, threadid=140735944844864) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140735944844864) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140735944844864) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140735944844864, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x00007ffff7642476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x00007ffff76287f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x00007ffff7aa2bbe in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x00007ffff7aae24c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #7 0x00007ffff7aae2b7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6 #8 0x00007ffff7aae518 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6 #9 0x00007ffff7aa27cc in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x00007ffff664b6ae in __gnu_cxx::new_allocatorCrashing code
Click to expand
```c++ bool ROSIDL_TYPESUPPORT_FASTRTPS_CPP_PUBLIC_trajectory_msgs cdr_deserialize( eprosima::fastcdr::Cdr & cdr, trajectory_msgs::msg::MultiDOFJointTrajectoryPoint & ros_message) { // Member: transforms { uint32_t cdrSize; cdr >> cdrSize; size_t size = static_cast