ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.62k stars 1.31k forks source link

nav2_map_server should not catch SIGSEGV, nor report it as "Magick: ..." #4703

Open mferenduros-ocado opened 1 month ago

mferenduros-ocado commented 1 month ago

Bug report

Required Info:

Steps to reproduce issue

Run nav2 with composition enabled Trigger a segfault in any nav2 component (eg. write though a null pointer in a planner or an MPPI critic)

Expected behavior

The component container process segfaults

Actual behavior

An error message that suggests the problem relates to GraphicsMagick, eg

[component_container_isolated-26] Magick: abort due to signal 11 (SIGSEGV) "Segmentation Fault"...

This message has appears to have led several people (including myself) to bad assumptions about what's caused the segfault.

IMO the preferred behaviour would be to leave the signal uncaught so we can use (eg) core-dumps to diagnose crashes edit: Magick calls abort() so we do get core-dumps.

Additional information

Magick::InitializeMagick registers process-wide handlers for a bunch of signals, including SIGSEGV, only skipping those signals that already have handlers registered. A possible (albeit racy) fix might be something like

auto prev_segv_handler = signal(SIGSEGV, SIG_DFL);
InitializeMagick();
signal(SIGSEGV, prev_segv_handler);
SteveMacenski commented 1 month ago

I wholeheartedly agree that it is annoying when Magick logs crashes when other things crash. Its a known quirk that developers have learned to ignore and look elsewhere for the actual cause of the crash. However if there's a way to prevent that from happening without making the situation less clear to developers, that would be lovely.

Do you see any clear issues with your suggestion, such as when a signal from ROS launch is being communicated to the map server or seg faults in that process not being reported appropriately (should be SIGINT/SIGTERM so that should be fine)? Have you tested that change to resolve the issue? Any other signals we should redirect (perhaps SIGILL)?

I see 2 places in mapIO that we set the Magick initialization and they can both be called repeatedly, so I think we're safe from that perspective.

mferenduros-ocado commented 1 month ago

:+1:

I haven't tested a fix, but would be happy to test and submit a PR if you're ok with the general approach. It is theoretically racy - if another thread was getting and setting signal handlers at exactly the wrong moment we'd risk overwriting their handlers, but that applies just as much to what magick is currently doing anyway. And in practice no-one else is calling signal() that I can see.

Conceptually I think it's questionable for a component to set any signal handlers, but SIGSEGV is the only one causing genuine annoyance so maybe let's limit it to that one to begin with?

SteveMacenski commented 1 month ago

agreed.

auto prev_segv_handler = signal(SIGSEGV, SIG_DFL);

Also with components, would this impact the other components in the same process? My understanding is that signal is global to a process, not a particular thread.

This leads me to believe there really isn't a solution - other than upfront documentation - to getting the magick message on composed process segfaults. I think that's just a quirk we have to live with perhaps