openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.12k stars 418 forks source link

Segfault when starting the worker context in JUCX #4870

Open abellina opened 4 years ago

abellina commented 4 years ago

When creating a UCX context with JUCX I see segfaults fairly regularly. I have been meaning to dig into this further, but I have not been able to. This is a bad issue for a while at least in 1.8.0:

20/03/11 15:09:29 INFO UCX: Creating UCX context.
Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f55274dc800)
==== backtrace (tid:   9182) ====
 0  /tmp/jucx1810311826203486326/libucs.so(ucs_handle_error+0x124) [0x7f4b6539ff14]
 1  /tmp/jucx1810311826203486326/libucs.so(+0x2833c) [0x7f4b653a033c]
 2  /tmp/jucx1810311826203486326/libucs.so(+0x285b4) [0x7f4b653a05b4]
 3  /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890) [0x7f552668b890]
 4  /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x8b2403) [0x7f5525f9b403]
 5  /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x480322) [0x7f5525b69322]
 6  /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x48956b) [0x7f5525b7256b]
 7  /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0xa590e3) [0x7f55261420e3]
 8  /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0xa5a2a8) [0x7f55261432a8]
 9  /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so(+0x8f2d82) [0x7f5525fdbd82]
10  /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f55266806db]
11  /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f5526dda88f]
=================================

Steps to Reproduce

 UcpParams contextParams = new UcpParams().requestTagFeature().requestWakeupFeature()
 new UcpContext(contextParams)

This happens in our case ~1/20 times or so. Unfortunately it doesn't happen every time.

abellina commented 4 years ago

@petro-rudenko fyi

petro-rudenko commented 4 years ago

Thanks, will check. Can you please try to run with UCX_MEM_EVENTS=no

abellina commented 4 years ago

Seems like UCX_MEM_EVENTS=no disables memory alerts. Is that what we really want? E.g. are there cases where there is a memory event that is not valid?

petro-rudenko commented 4 years ago

UCX_MEM_EVENTS=no will disable ucx memory hooks (intercept of malloc/free calls to speedup in some cases memreg). But if you register memory explicitly via context.memoryMap it shouldn't make big difference.

But yes, would need to find a cause.

petro-rudenko commented 4 years ago

Investigating issue. Solution, for now, is to set .UCX_ERROR_SIGNALS="" . For some reason ucs_debug_disable_signal doesn't disable sigaction for ucx catcher.

abellina commented 4 years ago

@petro-rudenko are you thinking these are JVM segfaults that UCX is handling by mistake? Just trying to understand the reasoning behind disabling error signals.

What kind of side effects will this have for normal operation of UCX?

abellina commented 4 years ago

@petro-rudenko I tried setting UCX_ERROR_SIGNALS to "", but I still see the error. Just letting you know I don't think this is it.

petro-rudenko commented 4 years ago

The issue is that

Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f55274dc800)
==== backtrace (tid:   9182) ====
 0  /tmp/jucx1810311826203486326/libucs.so(ucs_handle_error+0x124) [0x7f4b6539ff14]
 1  /tmp/jucx1810311826203486326/libucs.so(+0x2833c) [0x7f4b653a033c]
 2  /tmp/jucx1810311826203486326/libucs.so(+0x285b4) [0x7f4b653a05b4]

UCX shouldn't catch any signals, especially from java threads. Do you get the same stack trace with UCX_ERROR_SIGNALS=""?

abellina commented 4 years ago

Yes you are absolutely right. Turns out I had a typo in my setting, since I saw this message (which is produced by the signal handling code).

I've run for some hours today and yesterday without the startup crash. So this looks to be a good workaround @petro-rudenko. Given that java is going to use its own signals, I think signal catching by UCX is likely to cause all kinds of trouble in jni land (not unlike the spurious one I must have seen). I am surprised we don't see more caught signals during regular operation.

petro-rudenko commented 4 years ago

JUCX disables all signals at load time: https://github.com/openucx/ucx/blob/master/bindings/java/src/main/native/jucx_common_def.cc#L30

Trying to figure out why it still happens that signal handler is ucx.