Open mferenduros-ocado opened 1 month ago
I wholeheartedly agree that it is annoying when Magick logs crashes when other things crash. Its a known quirk that developers have learned to ignore and look elsewhere for the actual cause of the crash. However if there's a way to prevent that from happening without making the situation less clear to developers, that would be lovely.
Do you see any clear issues with your suggestion, such as when a signal from ROS launch is being communicated to the map server or seg faults in that process not being reported appropriately (should be SIGINT/SIGTERM so that should be fine)? Have you tested that change to resolve the issue? Any other signals we should redirect (perhaps SIGILL)?
I see 2 places in mapIO that we set the Magick initialization and they can both be called repeatedly, so I think we're safe from that perspective.
:+1:
I haven't tested a fix, but would be happy to test and submit a PR if you're ok with the general approach. It is theoretically racy - if another thread was getting and setting signal handlers at exactly the wrong moment we'd risk overwriting their handlers, but that applies just as much to what magick is currently doing anyway. And in practice no-one else is calling signal() that I can see.
Conceptually I think it's questionable for a component to set any signal handlers, but SIGSEGV is the only one causing genuine annoyance so maybe let's limit it to that one to begin with?
agreed.
auto prev_segv_handler = signal(SIGSEGV, SIG_DFL);
Also with components, would this impact the other components in the same process? My understanding is that signal
is global to a process, not a particular thread.
This leads me to believe there really isn't a solution - other than upfront documentation - to getting the magick message on composed process segfaults. I think that's just a quirk we have to live with perhaps
Bug report
Required Info:
Steps to reproduce issue
Run nav2 with composition enabled Trigger a segfault in any nav2 component (eg. write though a null pointer in a planner or an MPPI critic)
Expected behavior
The component container process segfaults
Actual behavior
An error message that suggests the problem relates to GraphicsMagick, eg
[component_container_isolated-26] Magick: abort due to signal 11 (SIGSEGV) "Segmentation Fault"...
This message has appears to have led several people (including myself) to bad assumptions about what's caused the segfault.
IMO the preferred behaviour would be to leave the signal uncaught
so we can use (eg) core-dumps to diagnose crashesedit: Magick calls abort() so we do get core-dumps.Additional information
Magick::InitializeMagick
registers process-wide handlers for a bunch of signals, including SIGSEGV, only skipping those signals that already have handlers registered. A possible (albeit racy) fix might be something like