wasmerio / wasmer-java

☕ WebAssembly runtime for Java
https://medium.com/wasmer/announcing-the-first-java-library-to-run-webassembly-wasmer-jni-89e319d2ac7c
MIT License
591 stars 55 forks source link

Asynchronous SEGV when running in Flink job #42

Closed jcaesar closed 3 years ago

jcaesar commented 4 years ago

Description

When using an instance created from a Flink job in docker, the taskmanagers (= things that execute the stream processing) die with a SEGV shortly after calling a wasm function.

Steps to reproduce

Requires a bit of setup, I've created a docker-compose based test case: https://github.com/jcaesar/wasmer-in-flink-segv You should be able to run it with

docker-compose -p asdf kill \
&& docker-compose -p asdf rm -vf \
&& docker-compose -p asdf up  --build --abort-on-container-exit

Actual behavior

taskmanager1_1  | *** Aborted
taskmanager1_1  | Register dump:
taskmanager1_1  | 
taskmanager1_1  |  RAX: 0000000000000000   RBX: 0000000000000006   RCX: 00007f86f91367bb
taskmanager1_1  |  RDX: 0000000000000000   RSI: 00007f8660967f10   RDI: 0000000000000002
taskmanager1_1  |  RBP: 00007f86609681e0   R8 : 0000000000000000   R9 : 00007f8660967f10
taskmanager1_1  |  R10: 0000000000000008   R11: 0000000000000246   R12: 00007f86609689b0
taskmanager1_1  |  R13: 00007f865005f070   R14: 000000000000000b   R15: 00007f85fc00e610
taskmanager1_1  |  RSP: 00007f8660967f10
taskmanager1_1  | 
taskmanager1_1  |  RIP: 00007f86f91367bb   EFLAGS: 00000246
taskmanager1_1  | 
taskmanager1_1  |  CS: 0033   FS: 0000   GS: 0000
taskmanager1_1  | 
taskmanager1_1  |  Trap: 0000000e   Error: 00000007   OldMask: 00000404   CR2: f94ffe80
taskmanager1_1  | 
taskmanager1_1  |  FPUCW: 0000037f   FPUSW: 00000000   TAG: 00000000
taskmanager1_1  |  RIP: 00000000   RDP: 00000000
taskmanager1_1  | 
taskmanager1_1  |  ST(0) 0000 0000000000000000   ST(1) 0000 0000000000000000
taskmanager1_1  |  ST(2) 0000 0000000000000000   ST(3) 0000 0000000000000000
taskmanager1_1  |  ST(4) 0000 0000000000000000   ST(5) 0000 0000000000000000
taskmanager1_1  |  ST(6) 0000 0000000000000000   ST(7) 0000 0000000000000000
taskmanager1_1  |  mxcsr: 1f80
taskmanager1_1  |  XMM0:  000000000000000000000000ffffffff XMM1:  000000000000000000000000ffffffff
taskmanager1_1  |  XMM2:  000000000000000000000000ffffffff XMM3:  000000000000000000000000ffffffff
taskmanager1_1  |  XMM4:  000000000000000000000000ffffffff XMM5:  000000000000000000000000ffffffff
taskmanager1_1  |  XMM6:  000000000000000000000000ffffffff XMM7:  000000000000000000000000ffffffff
taskmanager1_1  |  XMM8:  000000000000000000000000ffffffff XMM9:  000000000000000000000000ffffffff
taskmanager1_1  |  XMM10: 000000000000000000000000ffffffff XMM11: 000000000000000000000000ffffffff
taskmanager1_1  |  XMM12: 000000000000000000000000ffffffff XMM13: 000000000000000000000000ffffffff
taskmanager1_1  |  XMM14: 000000000000000000000000ffffffff XMM15: 000000000000000000000000ffffffff
taskmanager1_1  | 
taskmanager1_1  | Backtrace:
taskmanager1_1  | /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f86f91367bb]
taskmanager1_1  | /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7f86f9121535]
taskmanager1_1  | /tmp/wasmer_jni7073396356486144275.lib(+0x250f07)[0x7f85d6d3cf07]
taskmanager1_1  | /tmp/wasmer_jni7073396356486144275.lib(+0x24bb46)[0x7f85d6d37b46]
taskmanager1_1  | /tmp/wasmer_jni7073396356486144275.lib(+0x9fad0)[0x7f85d6b8bad0]
taskmanager1_1  | /tmp/wasmer_jni7073396356486144275.lib(+0x9f156)[0x7f85d6b8b156]
taskmanager1_1  | /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f86f94ee730]
taskmanager1_1  | /usr/local/openjdk-8/lib/amd64/server/libjvm.so(+0x6647a0)[0x7f86f87f47a0]
taskmanager1_1  | /usr/local/openjdk-8/lib/amd64/server/libjvm.so(+0x665744)[0x7f86f87f5744]
taskmanager1_1  | /usr/local/openjdk-8/lib/amd64/server/libjvm.so(JVM_DoPrivileged+0x2c6)[0x7f86f8855e96]
taskmanager1_1  | [0x7f86e11d0af5]
Hywan commented 3 years ago

Sorry for the late reply, I missed the notification. I'm going to take a look at it.

jcaesar commented 3 years ago

Ah, yay! I was afraid this repository was dead from the start...

It seems this problem can even be triggered before starting the Flink job (by e.g. moving the wasmer calls to the beginning of main). Any chance it's related to the horribly outdated version of Java?

[Edit:] At least switching the image to flink:1.11.2-scala_2.11-java11 doesn't help. I guess I would really need a build with debug symbols to find out what's going on.

jcaesar commented 3 years ago

I finally wanted to know what's going on a bit better and made myself a debug build of libwasmer_jni.so to translate the symbols in the backtrace. They just came down to

/rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4//library/std/src/sys/unix/mod.rs:231
/rustc/7eac88abb2e57e752f3302f02be5f3ce3d7adfb4//library/std/src/process.rs:1773
/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/wasmer-clif-backend-0.17.0/src/signal/unix.rs:147
/usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/wasmer-clif-backend-0.17.0/src/signal/unix.rs:30

this just seems to be an average signal handler, so I guess there is no relevant wasmer code running at the time the segv happens. Phew, looks like whoever wants to solve this is in for some rather fun debugging.

[Edit:] Oh, I guess the next step is to do as this comment asks.

jcaesar commented 3 years ago

So, I tried disabling the signal handlers to get to see the "real segv". Turns out that fixed the problem... Not sure what to make of that.

Incidentally, wasmer 1.0 doesn't have the signal handlers anymore. Wonder how this plays out when upgrading versions...

jcaesar commented 3 years ago

Hm, doesn't seem like migrating wasmer-java to 1.0 will be easy. Memory wants to own the Memory, but Exports doesn't hand that over. Guess I'll stop here, before I vanish in the rabbit hole.

Y'all got this planned out?

syrusakbary commented 3 years ago

@jcaesar Memory is clonable in Wasmer 1.0, so it should be doable I believe!

jcaesar commented 3 years ago

Oh, I would have assumed that would create a new copy of the actual memory area. Guess I haven't really understood the semantics. Nevertheless, I was able to construct a patch that makes wasmer-java use a current-ish wasmer master. The problem I report here disappears.

I'd throw a PR, but some of the gradle tests fail. Not sure if I'll work on that. A serious adaptation of wasmer 1.0.0 would have to chage the Java API anyway, I guess?

jcaesar commented 3 years ago

Fixed in 0.3.0. (Though testing that out was a bit annoying because 0.3.0 is somehow not properly released. empty folder? wat?)