Open abellina opened 4 years ago
@petro-rudenko fyi
Thanks, will check. Can you please try to run with UCX_MEM_EVENTS=no
Seems like UCX_MEM_EVENTS=no
disables memory alerts. Is that what we really want? E.g. are there cases where there is a memory event that is not valid?
UCX_MEM_EVENTS=no
will disable ucx memory hooks (intercept of malloc/free calls to speedup in some cases memreg). But if you register memory explicitly via context.memoryMap
it shouldn't make big difference.
But yes, would need to find a cause.
Investigating issue. Solution, for now, is to set .UCX_ERROR_SIGNALS=""
. For some reason ucs_debug_disable_signal
doesn't disable sigaction for ucx catcher.
@petro-rudenko are you thinking these are JVM segfaults that UCX is handling by mistake? Just trying to understand the reasoning behind disabling error signals.
What kind of side effects will this have for normal operation of UCX?
@petro-rudenko I tried setting UCX_ERROR_SIGNALS
to "", but I still see the error. Just letting you know I don't think this is it.
The issue is that
Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f55274dc800)
==== backtrace (tid: 9182) ====
0 /tmp/jucx1810311826203486326/libucs.so(ucs_handle_error+0x124) [0x7f4b6539ff14]
1 /tmp/jucx1810311826203486326/libucs.so(+0x2833c) [0x7f4b653a033c]
2 /tmp/jucx1810311826203486326/libucs.so(+0x285b4) [0x7f4b653a05b4]
UCX shouldn't catch any signals, especially from java threads. Do you get the same stack trace with UCX_ERROR_SIGNALS=""
?
Yes you are absolutely right. Turns out I had a typo in my setting, since I saw this message (which is produced by the signal handling code).
I've run for some hours today and yesterday without the startup crash. So this looks to be a good workaround @petro-rudenko. Given that java is going to use its own signals, I think signal catching by UCX is likely to cause all kinds of trouble in jni land (not unlike the spurious one I must have seen). I am surprised we don't see more caught signals during regular operation.
JUCX disables all signals at load time: https://github.com/openucx/ucx/blob/master/bindings/java/src/main/native/jucx_common_def.cc#L30
Trying to figure out why it still happens that signal handler is ucx.
When creating a UCX context with JUCX I see segfaults fairly regularly. I have been meaning to dig into this further, but I have not been able to. This is a bad issue for a while at least in 1.8.0:
Steps to Reproduce
This happens in our case ~1/20 times or so. Unfortunately it doesn't happen every time.