Closed ted-miller closed 1 year ago
Please provide instructions for replicating
Well that's part of the problem. I cannot reliably reproduce the issue on-demand. I don't know what is triggering the failure.
But here's what I did, verbatim.
Windows:
Open command prompt.
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvars64.bat"
call "C:\Users\millete\Projects\ROS-2\ros2-humble\local_setup.bat"
call "C:\path\to\ros2-humble\local_setup.bat"
mkdir "%USERPROFILE%\micro_ros_agent_ws\src"
git clone ^
-b humble ^
https://github.com/micro-ROS/micro_ros_msgs.git ^
"%USERPROFILE%\micro_ros_agent_ws\src\micro_ros_msgs"
git clone ^
-b humble ^
https://github.com/micro-ROS/micro-ROS-Agent.git ^
"%USERPROFILE%\micro_ros_agent_ws\src\micro-ROS-Agent"
cd "%USERPROFILE%\micro_ros_agent_ws"
colcon build ^
--merge-install ^
--packages-up-to micro_ros_agent ^
--cmake-args ^
"-DUAGENT_BUILD_EXECUTABLE=OFF" ^
"-DUAGENT_P2P_PROFILE=OFF" ^
"--no-warn-unused-cli"
cd "%USERPROFILE%\micro_ros_agent_ws
call install\setup.bat
ros2 run micro_ros_agent micro_ros_agent udp4 --port 8888
Robot connects to agent.
Open another command prompt
call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvars64.bat"
call "C:\Users\millete\Projects\ROS-2\ros2-humble\local_setup.bat"
ros2 topic echo /joint_states
Agent crashes.
I'll continue testing and see if I can come up with something more definitive.
I have found a reliable way to make it crash.
It seems to be related to the hard-liveliness-check. If I turn off my robot and wait ten seconds, the agent stops with [ros2run]: Process exited with failure 3221225477
.
I don't know that this is the same issue as the first crash I experienced. But, it is a reproducible issue.
EDIT: (This is on Windows 10)
I'll try to reproduce it.
Could you confirm if this issue is replicable under Linux?
@pablogs9: apologies for the 'vague' description here. It's just that we don't have any additional information ourselves, so we can not share anything more than @ted-miller already did.
Could you confirm if this issue is replicable under Linux?
In the meantime I've been able to reproduce this on Linux as well, specifically using the humble
Docker image.
A sequence of starting the humble
image, having the Client connect and then disappear (ie: not disconnect properly) while having subscribers active (one more multiple) seems to trigger it often. But not always.
Sometimes the Agent just hangs (no symptoms other than a non-responding Agent), other times it hangs while consuming 100% of a CPU core (and being non-responsive).
Sometimes it crashes with a SEGFAULT
.
It never really seems to print anything related to this, but I haven't ran it with increased debug level (so there might be something, but I haven't seen it).
I cannot reproduce it using the galactic
image.
Both Docker images were pulled last Friday (the 1st).
One complicating factor: we are not completely up-to-date on the Client side, and given the fact there seems to be some implication it has something to do with the liveness check (which appears to have had some PRs merged), it could be our Client is misbehaving (at least from the perspective of the Agent).
Again, our apologies for the vague report. I'll update when we have more information.
Edit: oh and when the Agent becomes unresponsive, the Client application receives non-zero return values from rclc_node_init_default(..)
and rclc_support_init_with_options(..)
.
And to clarify: while the Agent is unresponsive on the PC side, apparently it is responsive enough to still let the Client progress through at least a few phases in the connection handshake. It doesn't complete it though, as evidenced by the failures in rclc_node_init_default(..)
and rclc_support_init_with_options(..)
.
I'm going to try with some subscribers on the client-side.
Update: Working as expected with subscribers.
In which version of the Micro XRCE-DDS Client are you?
https://github.com/eProsima/Micro-XRCE-DDS-Client/commit/e3f6439013a1a9ecb0d4011d19d7a4cec2c84655
we had to track develop
for a while, but I believe this is in ros2
as well now.
I'm going to try with some subscribers on the client-side.
The subscribers I mentioned are on the Agent side (ie: ROS PC).
Update: Working as expected with subscribers.
just to make sure: it takes a few disappearances of the Client to trigger this. It's not 100% reproducible all the time.
just to make sure: it takes a few disappearances of the Client to trigger this. It's not 100% reproducible all the time.
How many aprox? 10? 100?
2 to 5 when I tried to reproduce it on Friday.
And we're using UDPv4.
Command used to start the Agent (in my case, @ted-miller's is different, as he's on Windows, but he's shown the command he's using in https://github.com/micro-ROS/micro-ROS-Agent/issues/157#issuecomment-1172317991):
docker run \
-it \
--rm \
--net=host \
microros/micro-ros-agent:humble \
udp4 \
--port 8888
I've just updated our XRCE-DDS Client to v2.2.0
, and can still reproduce this with the humble
image.
Sequence of steps/commands:
micro-ros-agent:humble
image using the command mentioned earlierJointState
subscribers, both on the same topic), keep them runningSEGFAULT
in this case)I'll see if I can run the Agent in gdb
, but as I'm using the Docker image, that may not reveal too much.
Ok, we are going to replicate this scenario.
Replicated, working on it.
Could you check if you can replicate this using a bare Micro XRCE-DDS Agent instead of the micro-ROS Agent?
I'm participating in World ROS-I day today, but I believe @ted-miller will be online in about 6 hours from now.
Would you have an idea / hunch already?
I can try this today
BTW: Last week, I opened the uros Agent Visual Studio solution in the hopes that it would show me where the crash occurs. The call stack seems to indicate that the exception is coming from the fastrtps library. But, I didn't have the debug symbols, so I'm not getting any real useful info.
I tried building a version with debug symbols. But, I couldn't get it to link everything properly.
EDIT: The call stack also included the micro xrce-dds agent library too. So, it could be an issue in the xrce agent that passed invalid data to fastrtps.
I was not able to reproduce the issue using the micro-xrce-dds agent.
cd xrce_ws/src
git clone -b ros2 https://github.com/eProsima/Micro-XRCE-DDS-Agent.git
cd ../
colcon build
source install/local_setup.bash
cd install/microxrcedds_agent/bin
./MicroXRCEAgent udp4 --port 8888
I have connected (using the same client application on the robot as previous tests) 10 times without issue.
@ted-miller: but there were no subscribers that time, were there?
Correct, there were no subscribers.
But, I didn't have any subscribers in previous reproductions either.
Ah, ok.
I've always had subscribers, and that seems to 'reliably' trigger it.
Another data point: just had the Agent SEGFAULT
on me as I was creating a new subscriber.
So same steps as in https://github.com/micro-ROS/micro-ROS-Agent/issues/157#issuecomment-1175106689, but I didn't make it past step 3.
Same Client version, same version of the Docker image (humble
).
We replicated this using your instructions, and the issue seems to be identified by the Fast DDS team.
CC: @EduPonz @MiguelCompany @jsantiago-eProsima
friendly ping @pablogs9.
Not sure @gavanderhoorn, maybe @EduPonz @MiguelCompany @jsantiago-eProsima can tell you
are eProsima/Fast-DDS#2794, eProsima/Fast-DDS#2801 and eProsima/Fast-DDS#2828 related?
It sure looks like it! We are bundling a Fast DDS v2.6.2 by the end of this week so it'll be included in the next Humble sync.
@EduPonz thanks.
Was this a problem specific to Humble (ie: Fast-DDS on Humble)? I've not been able to reproduce the problem(s) with a Galactic image.
The Agent Docker images also get FastDDS from the OR repositories, correct?
@gavanderhoorn micro-ROS Agent uses the installed Fast DDS version if it is available. So in the docker, it uses the OSRF distributed binary, yes.
Have the PRs related to the problem discussed in this issue been merged upstream already? I'd like to test whether the crash has been solved.
From your description it sounds like I could force using a from-source build of FastRTPS by avoiding installing the binary packages. Correct?
Could you check if https://github.com/micro-ROS/micro-ROS-Agent/pull/169 solves this issue?
Yes, #169 appears to have fixed the issue. Thank you.
Thanks for the fix.
Have/will the docker image(s) be(en) updated?
@gavanderhoorn ongoing generation:https://github.com/micro-ROS/docker/actions/runs/2912224500
Sorry for the delay in the fix!
Describe the bug The agent crashes.
On Windows:
[ros2run]: Process exited with failure 3221225477
On Ubuntu (using docker image):[ros2run]: Segmentation fault
To Reproduce Not sure.
For the first crash (using Windows), my robot connected to the agent just fine. Then I opened another command prompt and tried
ros2 topic echo /joint_states
. The agent immediately failed with[ros2run]: Process exited with failure 3221225477
.Then I switched to my Ubuntu machine and started up the docker image.
It seemed to be working, so I assumed it was something wrong with the Windows version.
I left the agent connected with the robot running. (Robot might have been rebooted at some point; I don't really know.) Some time later, I went to shut down the Ubuntu machine and saw the agent had failed.
[ros2run] Segmentation fault
So, I figured I would try it on Windows again to get a procedure to reproduce the error. But, I can't make it happen again.
System information (please complete the following information):
Additional context Up until now, I had been using a galactic version of the Agent. I had not had any problems with it. This is my first time using Humble.