Open JWhitleyWork opened 3 years ago
I'd say this:
[state_estimation_node_exe-1] 1608317348.993236 [1] state_esti: ddsi_udp_create_conn: set IP_MULTICAST_IF failed: Bad Parameter
is the weird bit: then the DDS domain doesn't start, the participant can't be created, nor can the node be created, &c. So one would thing everything else is collateral damage. That error really means:
return ddsrt_setsockopt (sock, IPPROTO_IP, IP_MULTICAST_IF, gv->ownloc.address + 12, 4);
failed, or really
if (setsockopt(sock, level, optname, optval, optlen) == -1) {
goto err_setsockopt;
}
Unfortunately, many errors get mapped to BAD_PARAMETER
, so we don't know exactly what it is. However, sock
must be ok or it wouldn't have made it to this point. So it is most likely that the kernel is unhappy with the address.
There are some things that should be fairly simple to do and that ought to provide a bit more information:
strace
to trace the inputs and outputs of the networking related system calls;I suspect that will show it always tries to set the multicast interface to use for outgoing packets, also when the socket isn't used for transmitting and even when the interface doesn't support multicast. Perhaps that is what the kernel doesn't like ... The good news is that the fix will be pretty straightforward if this is the case.
You might want to give this:
https://github.com/eboasson/cyclonedds/tree/mc-options-only-when-needed
a try. It is based on the possibility that you are not actually using multicast and that there is therefore no need to set this particular option. Otherwise, I really need to know a bit more about what it actually passes into the kernel and which network interfaces exist (like I mentioned). I have no reason to believe the interface numbers Cyclone uses are incorrect, but I've been caught by stupid bugs before.
As another data point, it doesn't happen on Ubuntu 20.04 on qemu on macOS on an M1 ... that's not emulation, of course, but still emulated devices.
@eboasson Do you know how to easily replace the installed CycloneDDS installation with this one, built from source? Just adding the repository from my workspace, colcon
doesn't seem to recognize that it's a deep-down dependency when building with --packages-up-to
.
@JWhitleyWork I used to do --packages-up-to rmw_cyclonedds_cpp
but for rolling/galatic that's no longer needed. The easiest should be to just do colcon build --packages-select cyclonedds
, all you need is the new libddsc.so.0.7.0
(or whatever the number is).
What version of rmw_cyclonedds_cpp
are you using, by the way? Because the RMW layer uses some things that aren't part of the stable API to paper over differences between ROS2 and DDS and some changes happened that are binary compatible but not quite source compatible. The patch I did should cherry-pick cleanly to just about any version of Cyclone if needed.
I'm seeing a similar error message (ddsi_udp_create_conn: set IP_MULTICAST_IF failed: Bad Parameter
) when trying to use a publisher in a arm64 Ubuntu container that's running on a 64-bit Debian 11 host using qemu-user-static for the emulation.
I was able to replicate the error with a DDS publisher without ros 2 (using cyclonedds-cxx). I tried applying the commit @eboasson linked to in his branch above to cyclonedds, but the commit didn't apply cleanly to the latest version of cyclonedds.
@jpace121, I rebased https://github.com/eboasson/cyclonedds/tree/mc-options-only-when-needed and would appreciate it very much if you gave it another go.
That didn't seem to fix it. With that being said, I noticed tonight that I'm also getting the following warning on stdout:
qemu: uncaught target signal 6 (Aborted) - core dumped
Aborted (core dumped)
so I'm guessing my bug may be caused by something going on in qemu, not cyclone.
Thanks for the assistance.
@jpace121 I imagine any qemu
crash would be rather serious, I'm sure they'd like to hear of it.
Anyway, the point of that branch was to avoid doing setsockopt(... IP_MULTICAST_IF ...)
when multicast is not used. One reason was that it seemed plausible (however remotely) that it might give an error if multicast wasn't enabled somewhere in the kernel or below; the other reason is that it should at least give a workaround by disabling multicast in Cyclone.
One thing I have wondered about is what it is actually passing into the kernel. As these calls happen really early on in the process and it sounds like it is fairly easily reproduced (especially now you don't even need to involve ROS), perhaps it would be possible to log the system calls with strace -v
(I think that's the option that gives all the gory details; possibly -e network
would be sensible too). That would at least give us some more understanding of what goes on when it fails.
I can reproduce this in a much shorter invocation using the example rclcpp executables. Below is on Galactic, I saw it first on foxy.
$ docker run --platform=linux/aarch64 -ti ros bash
Unable to find image 'ros:latest' locally
latest: Pulling from library/ros
Digest: sha256:dec83e0668b43f0f7734935d91a1c9c7a89fba6dd878cd2bd6aedffd196132c3
Status: Downloaded newer image for ros:latest
root@66357d5ba0db:/# apt-get update -qq && apt-get install -qqy ros-galactic-examples-rclcpp-minimal-subscriber ros-galactic-examples-rclcpp-minimal-publisher
## INSTALL LOGS TRUNCATED
root@66357d5ba0db:/# . /opt/ros/galactic/setup.bash
root@66357d5ba0db:/# export ROS_DOMAIN_ID=84
root@66357d5ba0db:/# ros2 run examples_rclcpp_minimal_publisher publisher_member_function
Unsupported setsockopt level=0 optname=32
1643936991.678175 [84] publisher_: ddsi_udp_create_conn: set IP_MULTICAST_IF failed: Bad Parameter
[ERROR] [1643936991.681629437] [rmw_cyclonedds_cpp]: rmw_create_node: failed to create domain, error Error
>>> [rcutils|error_handling.c:108] rcutils_set_error_state()
This error state is being overwritten:
'error not set, at /tmp/binarydeb/ros-galactic-rcl-3.1.2/src/rcl/node.c:261'
with this new error message:
'rcl node's rmw handle is invalid, at /tmp/binarydeb/ros-galactic-rcl-3.1.2/src/rcl/node.c:413'
rcutils_reset_error() should be called after error handling to avoid this.
<<<
[ERROR] [1643936991.685166614] [rcl]: Failed to fini publisher for node: 1
terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
what(): failed to initialize rcl node: rcl node's rmw handle is invalid, at /tmp/binarydeb/ros-galactic-rcl-3.1.2/src/rcl/node.c:413
qemu: uncaught target signal 6 (Aborted) - core dumped
Hi @tfoote, one would think the patch in linked to in https://github.com/ros2/rmw_cyclonedds/issues/273#issuecomment-930915050 will fix it by not doing IP_MULTICAST_IF
when multicast is not enabled. It is not in the 0.8.x because I didn't want to change that branch without having confirmed that it solved the problem, but if it works I'll merge it.
But I'd still like to know what's going on, and so if you have a chance to run reproduce with Cyclone tracing enabled (i.e., export CYCLONEDDS_URI='<Tr><C>trace</C><Out>cdds.log.${CYCLONEDDS_PID}</></>'
and with strace
to get the exact parameters passed to the system call, I'd be grateful. The latter is the only way I can think of to be absolutely certain that it gets passed the right parameters (I think that's likely, I just want to be sure), and the former has everything to tell me if it ends up with multicast enabled or disabled internally.
Here's an example of the cdds.log generated.
To reproduce this:
docker run --platform=linux/aarch64 -ti ros bash
And inside:
export CYCLONEDDS_URI='<Tr><C>trace</C><Out>cdds.log.${CYCLONEDDS_PID}</></>'
apt-get install -qqy ros-galactic-examples-rclcpp-minimal-subscriber ros-galactic-examples-rclcpp-minimal-publisher
. /opt/ros/galactic/setup.bash
export ROS_DOMAIN_ID=84
apt-get install strace
strace /opt/ros/galactic/lib/examples_rclcpp_minimal_publisher/publisher_member_function
cat cdds.log.*
Bug report
Required Info:
arm64
rmw_cyclonedds_cpp
Steps to reproduce issue
Expected behavior
Tests pass.
Actual behavior
First test fails with the following error on Dashing:
And the following on Foxy:
Additional information
This has been tested both on
arm64
-native hardware and usingarm64
binary translation with QEMU. The test fails consistently in both Dashing and Foxy when using QEMU and occasionally onarm64
-native hardware depending on the system load indicating that this is timing-dependent. Usingrmw_fastrtps_cpp
, this failure does not occur. We have also only seen this error on this one test indicating it's a pretty corner-case scenario.For some context to aide in troubleshooting, this test intentionally provides the
state_estimation_node
with conflicting parametersdata_driven: true
andoutput_frequency: 30.0
. Using the free functiontime_between_publish_requests
insrc/prediction/state_estimation_node/src/state_estimation_node.cpp
with both of these parameters should throw astd::logic_error
on line 74 before any publishers/subscribers are created but after the node has been constructed/initialized. In Dashing, there is no indication that this exception is ever thrown but Foxy does show the output with the correct exception, only after the unexpectedrcl
error.