micro-ROS / micro-ROS-Agent

ROS 2 package using Micro XRCE-DDS Agent.
Apache License 2.0
97 stars 51 forks source link

Agent crash #157

Closed ted-miller closed 1 year ago

ted-miller commented 2 years ago

Describe the bug The agent crashes.

On Windows: [ros2run]: Process exited with failure 3221225477 On Ubuntu (using docker image): [ros2run]: Segmentation fault

To Reproduce Not sure.

For the first crash (using Windows), my robot connected to the agent just fine. Then I opened another command prompt and tried ros2 topic echo /joint_states. The agent immediately failed with [ros2run]: Process exited with failure 3221225477.

Then I switched to my Ubuntu machine and started up the docker image.

docker run \
  -it \
  --rm \
  --net=host \
  microros/micro-ros-agent:humble \
    udp4 \
    --port 8888

It seemed to be working, so I assumed it was something wrong with the Windows version.

I left the agent connected with the robot running. (Robot might have been rebooted at some point; I don't really know.) Some time later, I went to shut down the Ubuntu machine and saw the agent had failed. [ros2run] Segmentation fault

So, I figured I would try it on Windows again to get a procedure to reproduce the error. But, I can't make it happen again.

System information (please complete the following information):

Additional context Up until now, I had been using a galactic version of the Agent. I had not had any problems with it. This is my first time using Humble.

pablogs9 commented 2 years ago

Please provide instructions for replicating

ted-miller commented 2 years ago

Well that's part of the problem. I cannot reliably reproduce the issue on-demand. I don't know what is triggering the failure.

But here's what I did, verbatim.

Windows:

Open command prompt.

call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvars64.bat"
call "C:\Users\millete\Projects\ROS-2\ros2-humble\local_setup.bat"

call "C:\path\to\ros2-humble\local_setup.bat"
mkdir "%USERPROFILE%\micro_ros_agent_ws\src"
git clone ^
  -b humble ^
  https://github.com/micro-ROS/micro_ros_msgs.git ^
  "%USERPROFILE%\micro_ros_agent_ws\src\micro_ros_msgs"
git clone ^
  -b humble ^
  https://github.com/micro-ROS/micro-ROS-Agent.git ^
  "%USERPROFILE%\micro_ros_agent_ws\src\micro-ROS-Agent"
cd "%USERPROFILE%\micro_ros_agent_ws"
colcon build ^
  --merge-install ^
  --packages-up-to micro_ros_agent ^
  --cmake-args ^
  "-DUAGENT_BUILD_EXECUTABLE=OFF" ^
  "-DUAGENT_P2P_PROFILE=OFF" ^
  "--no-warn-unused-cli"

cd "%USERPROFILE%\micro_ros_agent_ws
call install\setup.bat
ros2 run micro_ros_agent micro_ros_agent udp4 --port 8888

Robot connects to agent.

Open another command prompt

call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvars64.bat"
call "C:\Users\millete\Projects\ROS-2\ros2-humble\local_setup.bat"

ros2 topic echo /joint_states

Agent crashes.

I'll continue testing and see if I can come up with something more definitive.

ted-miller commented 2 years ago

I have found a reliable way to make it crash.

It seems to be related to the hard-liveliness-check. If I turn off my robot and wait ten seconds, the agent stops with [ros2run]: Process exited with failure 3221225477.

I don't know that this is the same issue as the first crash I experienced. But, it is a reproducible issue.

EDIT: (This is on Windows 10)

pablogs9 commented 2 years ago

I'll try to reproduce it.

pablogs9 commented 2 years ago

Could you confirm if this issue is replicable under Linux?

gavanderhoorn commented 2 years ago

@pablogs9: apologies for the 'vague' description here. It's just that we don't have any additional information ourselves, so we can not share anything more than @ted-miller already did.

Could you confirm if this issue is replicable under Linux?

In the meantime I've been able to reproduce this on Linux as well, specifically using the humble Docker image.

A sequence of starting the humble image, having the Client connect and then disappear (ie: not disconnect properly) while having subscribers active (one more multiple) seems to trigger it often. But not always.

Sometimes the Agent just hangs (no symptoms other than a non-responding Agent), other times it hangs while consuming 100% of a CPU core (and being non-responsive).

Sometimes it crashes with a SEGFAULT.

It never really seems to print anything related to this, but I haven't ran it with increased debug level (so there might be something, but I haven't seen it).

I cannot reproduce it using the galactic image.

Both Docker images were pulled last Friday (the 1st).

One complicating factor: we are not completely up-to-date on the Client side, and given the fact there seems to be some implication it has something to do with the liveness check (which appears to have had some PRs merged), it could be our Client is misbehaving (at least from the perspective of the Agent).

Again, our apologies for the vague report. I'll update when we have more information.


Edit: oh and when the Agent becomes unresponsive, the Client application receives non-zero return values from rclc_node_init_default(..) and rclc_support_init_with_options(..).

And to clarify: while the Agent is unresponsive on the PC side, apparently it is responsive enough to still let the Client progress through at least a few phases in the connection handshake. It doesn't complete it though, as evidenced by the failures in rclc_node_init_default(..) and rclc_support_init_with_options(..).

pablogs9 commented 2 years ago

I'm going to try with some subscribers on the client-side.

Update: Working as expected with subscribers.

In which version of the Micro XRCE-DDS Client are you?

gavanderhoorn commented 2 years ago

https://github.com/eProsima/Micro-XRCE-DDS-Client/commit/e3f6439013a1a9ecb0d4011d19d7a4cec2c84655

we had to track develop for a while, but I believe this is in ros2 as well now.

I'm going to try with some subscribers on the client-side.

The subscribers I mentioned are on the Agent side (ie: ROS PC).

Update: Working as expected with subscribers.

just to make sure: it takes a few disappearances of the Client to trigger this. It's not 100% reproducible all the time.

pablogs9 commented 2 years ago

just to make sure: it takes a few disappearances of the Client to trigger this. It's not 100% reproducible all the time.

How many aprox? 10? 100?

gavanderhoorn commented 2 years ago

2 to 5 when I tried to reproduce it on Friday.

And we're using UDPv4.

Command used to start the Agent (in my case, @ted-miller's is different, as he's on Windows, but he's shown the command he's using in https://github.com/micro-ROS/micro-ROS-Agent/issues/157#issuecomment-1172317991):

docker run \
  -it \
  --rm \
  --net=host \
  microros/micro-ros-agent:humble \
    udp4 \
    --port 8888
gavanderhoorn commented 2 years ago

I've just updated our XRCE-DDS Client to v2.2.0, and can still reproduce this with the humble image.

Sequence of steps/commands:

  1. start the micro-ros-agent:humble image using the command mentioned earlier
  2. start the Client
  3. start a nr of subscribers on the ROS side (I used two JointState subscribers, both on the same topic), keep them running
  4. reboot the robot (session is not shutdown nicely, so timeout teardown is needed)
  5. Agent notices Client disappeared, tears down session (prints "destroyed due to liveliness timeout")
  6. robot finished reboot, Client attempts to reconnect
  7. reconnection succeeds, but ROS subscribers don't get new data
  8. at this point Agent crashes (with a SEGFAULT in this case)
  9. Client notices Agent has disappeared (as expected)

I'll see if I can run the Agent in gdb, but as I'm using the Docker image, that may not reveal too much.

pablogs9 commented 2 years ago

Ok, we are going to replicate this scenario.

pablogs9 commented 2 years ago

Replicated, working on it.

pablogs9 commented 2 years ago

Could you check if you can replicate this using a bare Micro XRCE-DDS Agent instead of the micro-ROS Agent?

gavanderhoorn commented 2 years ago

I'm participating in World ROS-I day today, but I believe @ted-miller will be online in about 6 hours from now.

Would you have an idea / hunch already?

ted-miller commented 2 years ago

I can try this today

ted-miller commented 2 years ago

BTW: Last week, I opened the uros Agent Visual Studio solution in the hopes that it would show me where the crash occurs. The call stack seems to indicate that the exception is coming from the fastrtps library. But, I didn't have the debug symbols, so I'm not getting any real useful info.

I tried building a version with debug symbols. But, I couldn't get it to link everything properly.

EDIT: The call stack also included the micro xrce-dds agent library too. So, it could be an issue in the xrce agent that passed invalid data to fastrtps.

ted-miller commented 2 years ago

I was not able to reproduce the issue using the micro-xrce-dds agent.

cd xrce_ws/src
git clone -b ros2 https://github.com/eProsima/Micro-XRCE-DDS-Agent.git
cd ../
colcon build
source install/local_setup.bash
cd install/microxrcedds_agent/bin
./MicroXRCEAgent udp4 --port 8888

I have connected (using the same client application on the robot as previous tests) 10 times without issue.

gavanderhoorn commented 2 years ago

@ted-miller: but there were no subscribers that time, were there?

ted-miller commented 2 years ago

Correct, there were no subscribers.

But, I didn't have any subscribers in previous reproductions either.

gavanderhoorn commented 2 years ago

Ah, ok.

I've always had subscribers, and that seems to 'reliably' trigger it.

gavanderhoorn commented 2 years ago

Another data point: just had the Agent SEGFAULT on me as I was creating a new subscriber.

So same steps as in https://github.com/micro-ROS/micro-ROS-Agent/issues/157#issuecomment-1175106689, but I didn't make it past step 3.

Same Client version, same version of the Docker image (humble).

pablogs9 commented 1 year ago

We replicated this using your instructions, and the issue seems to be identified by the Fast DDS team.

CC: @EduPonz @MiguelCompany @jsantiago-eProsima

gavanderhoorn commented 1 year ago

@pablogs9: are https://github.com/eProsima/Fast-DDS/pull/2794, https://github.com/eProsima/Fast-DDS/pull/2801 and https://github.com/eProsima/Fast-DDS/pull/2828 related?

gavanderhoorn commented 1 year ago

friendly ping @pablogs9.

pablogs9 commented 1 year ago

Not sure @gavanderhoorn, maybe @EduPonz @MiguelCompany @jsantiago-eProsima can tell you

EduPonz commented 1 year ago

are eProsima/Fast-DDS#2794, eProsima/Fast-DDS#2801 and eProsima/Fast-DDS#2828 related?

It sure looks like it! We are bundling a Fast DDS v2.6.2 by the end of this week so it'll be included in the next Humble sync.

gavanderhoorn commented 1 year ago

@EduPonz thanks.

Was this a problem specific to Humble (ie: Fast-DDS on Humble)? I've not been able to reproduce the problem(s) with a Galactic image.

gavanderhoorn commented 1 year ago

The Agent Docker images also get FastDDS from the OR repositories, correct?

pablogs9 commented 1 year ago

@gavanderhoorn micro-ROS Agent uses the installed Fast DDS version if it is available. So in the docker, it uses the OSRF distributed binary, yes.

gavanderhoorn commented 1 year ago

Have the PRs related to the problem discussed in this issue been merged upstream already? I'd like to test whether the crash has been solved.

From your description it sounds like I could force using a from-source build of FastRTPS by avoiding installing the binary packages. Correct?

pablogs9 commented 1 year ago

Check this and this

pablogs9 commented 1 year ago

Could you check if https://github.com/micro-ROS/micro-ROS-Agent/pull/169 solves this issue?

ted-miller commented 1 year ago

Yes, #169 appears to have fixed the issue. Thank you.

gavanderhoorn commented 1 year ago

Thanks for the fix.

Have/will the docker image(s) be(en) updated?

pablogs9 commented 1 year ago

@gavanderhoorn ongoing generation:https://github.com/micro-ROS/docker/actions/runs/2912224500

Sorry for the delay in the fix!