ros1_bridge silently stops working and cannot be restarted

dljsjr commented 6 years ago

Bug report

Required Info:

Operating System: Ubuntu 16.04
Installation type: Binaries
Version or commit hash: Ardent
DDS implementation: Fast-RTPS
Client library (if applicable): N/A

Steps to reproduce issue

This is probably not an easy repro, I'm not even sure what causes it myself. Right now I'm looking for information on how to get more verbosity out of the bridge so I can figure out what is hanging and why, because the current failure mode is entirely silent.

We are using a version of the ros1_bridge with our messages built in to it. We haven't modified the bridge itself, just linked our messages in to it. You can find a .tar'd install space of it here: https://bintray.com/ihmcrobotics/distributions/ihmc-ros1-bridge

We are able to use the bridge to successfully send and receive topics to and from ROS 1 based software (ROS 1 Kinetic installed from apt-get on Ubuntu 16.04) but occasionally the bridge just stops working. It will no longer create console output when a new talker or listener attempts to participate. The bridge also stops responding to INT signals. The only way to terminate the bridge is to send an SIGKILL. Even more interestingly, once you have sent SIGKILL to the bridge and stopped the ROS 2 daemon with ros2 daemon stop, then restarted the bridge, it continues to behave in this way. Bridges for new talkers/listeners do not get created and signals are not handled.

I've tried SIGKILL'ing the bridge then restarting it with --show-introspection and it prints a few messages about bridging some running ROS 2 publishers that we have but after just a few seconds it stops printing any introspections as well.

I'd love to be able to provide more useful information I just don't know where to start since the failure mode is entirely silent. If I can rebuild the bridge with some debug flags or something to get more verbose logging I'd love to start there and hopefully get this figured out.

Expected behavior

ROS 1 bridge is able to bridge ROS 2 and ROS 1 talkers/listeners

Actual behavior

ROS 1 bridge stops working and cannot be restarted

calvertdw commented 6 years ago

Did you call ros2 daemon start before restarting the bridge?

calvertdw commented 6 years ago

This could be related to Fast-RTPS configuration on the other end, as it is an IHMC implementation of a ROS 2 node, not using rmw.

calvertdw commented 6 years ago

From https://github.com/ros2/ros2/wiki/Linux-Install-Debians:

ros-$ROS_DISTRO-*-dbgsym. These packages provide the debugging symbols stripped from binaries.

dljsjr commented 6 years ago

@calvertdw I believe the daemon restarts itself the first time you launch a ROS 2 process, I've never had to manually start it any other time.

I also don't believe it's an issue with our stuff because our comms remain alive (you can still talk to our stuff via ROS 2 directly). You can publish and subscribe to our topics without issue.

Debug symbols for the bridge won't be valid because we've recompiled the bridge. I can rebuild ours with debug flags and use GDB to start a C++ debugger session but I don't know what I'm looking for, that's what I'm hoping to figure out.

dirk-thomas commented 6 years ago

Can you try to reproduce the problem while running a debug build of the bridge in gdb? Once it "stops" working maybe the stacktrace will provide enough context.

dljsjr commented 6 years ago

So upon continued investigation I'm actually starting to believe that this is related to an interaction between the bridge and the ROS 1 roscore and not something dying in the bridge itself.

@dirk-thomas I can work on getting something like that set up but it might be a day or two because the current place we're investigating this is in "production" on a robot so I'll have to make a test setup when I get back to Florida (I'm in Houston at Johnson working on Valkyrie right now).

For a bit of context on the setup and an overview of the topology: As I mentioned above this is a deployment being tested on the NASA Valkyrie humanoid. They have two on-board computers, one for real-time feedback control and one for non-realtime perception and out of band management of the asynchronous API. The real-time machine also has a custom PCI appliance for talking to the motor amps and other embedded systems like the power management board. The real-time machine runs our control stack (which has a DDS API implemented via Fast-RTPS and implementing the ROS 2 partition/namespace standards) as well as the NASA management stack which is ROS 1 based. So roscore is on this machine.

The non-realtime machine runs the vision stuff (multisense drivers) and the ros1_bridge dynamic_bridge. We were doing some cycle testing yesterday when we were able to get the bridge in to the same state I described above, and even after doing a full power cycle of the non-realtime machine we were not able to get the bridge to restart correctly. Additionally, after about one or two minutes in this weird state, the robot power browned out without a load spike or system reboot of the real-time box, meaning the ROS 1 based management stack on the real-time computer had seized for long enough that a bunch of heartbeats got missed.

I think it'll be easy enough to attach GDB to the bridge but I'm not so sure that the indicative information itself will be in the bridge (though it might at a minimum give us a stack trace showing us what interaction with ROS 1 is hanging and why?), which is why I'll have to try and reproduce this off the robot because if we attach GDB to any of the processes on the real-time machine we'll probably miss deadlines and the whole thing won't be able to run anyway.

dljsjr commented 6 years ago

I also need access to a build machine back in Pensacola to make a debug build of the bridge because our message package OOM's the bridge compile process on machines that don't have > 16GB of RAM even when building single job thread and my laptop can't cut it.

dirk-thomas commented 6 years ago

If you want to try it early on the robot you could comment in some of the print statements which output various messages based on "progress" / "activity" in the bridge (sorry, that was written before log macros were available). Maybe that will also provide some information on what the bridge is doing when it is "hanging". It might even be enough to rule out a problem in the bridge if that is the case.

our message package OOM's the bridge compile process on machines that don't have > 16GB of RAM even when building single job thread

Wow, that is pretty extreme. We are aware that the bridge needs quite some memory to build due to the template specializations but I have only seen 2-4 GB per thread. Independent of the problem in this ticket if you could provide an example with a similar memory usage in a separate ticket it would be good to look into it and see what can be done to lower the resource usage.

dljsjr commented 6 years ago

@dirk-thomas to be fair it's probably our fault for having 162 messages in a single package… we're going to break that up in our next release cycle :)

dirk-thomas commented 6 years ago

to be fair it's probably our fault for having 162 messages in a single package

Oh, I see. I was worried that it was due to some deep nesting or similar. I can totally see how 162 msgs in a single package could get you there :wink:

dljsjr commented 6 years ago

@dirk-thomas I'm still working on creating a debug build of the bridge (the debug symbols take up even more memory and I don't have a machine that can build it… working on getting our packages sorted out first), but something of note is that ever since we started using --bridge-all-topics pursuant to the "issue" I was having in #130 we have not been able to reproduce any crashes or hard hangs.

dljsjr commented 6 years ago

@dirk-thomas I was able to create a bridge with debug symbols but similar to #131 I can't reproduce this using simulations. I'll see if we can recreate it on the real robot and get you some information but it's going to be tricky to get you an example you can run.

dirk-thomas commented 5 years ago

I will go ahead and close this for now since we can't do anything without further information. Please feel free to comment on the closed ticket with the requested information and the issue can be reopened if necessary.

ros2 / ros1_bridge