philss / rustler_precompiled

Use precompiled NIFs from trusted sources in your Elixir code
182 stars 26 forks source link

Exit when compiling library build with RustlerPrecompiled on OTP 23 (NIF version 2.14) #23

Closed whatyouhide closed 2 years ago

whatyouhide commented 2 years ago

It's me, again! After dealing with #21, I’m now left with a problem that seems a bit harder to debug.

I’m trying to use nimble_lz4 (the library I built with rustler_precompiled) as a dependency. I can't get it to compile in an Alpine container.

The smallest case I could reproduce is with this Dockerfile:

FROM hexpm/elixir:1.13.4-erlang-23.3.4.9-alpine-3.14.2

COPY reproduce_bug.exs reproduce_bug.exs

RUN mix do local.hex --force, local.rebar --force
RUN elixir reproduce_bug.exs

and this reproduce_bug.exs file:

Mix.install([{:nimble_lz4, "~> 0.1.2"}])

The Docker build exits with exit code 139, which is usually segmentation fault. The output of the Docker build is in the details below.

Output of the Docker build ``` #1 [internal] load build definition from Dockerfile #1 sha256:871067c4caa77dfbe0a654d2a01a9bdbe6389ee58db65b3ba8ddc23fb03086b4 #1 transferring dockerfile: 305B done #1 DONE 0.0s #2 [internal] load .dockerignore #2 sha256:5a334d37c1da119701ceb248e83bd4879f882e86589b5268143a39eca073ed5c #2 transferring context: 2B done #2 DONE 0.0s #3 [internal] load metadata for docker.io/hexpm/elixir:1.13.4-erlang-23.3.4.9-alpine-3.14.2 #3 sha256:b8e862e530da8bdbcc6bcc9c374ed15293bcad31cc3f9afb487fd342a51a2c1c #3 DONE 0.4s #4 [1/5] FROM docker.io/hexpm/elixir:1.13.4-erlang-23.3.4.9-alpine-3.14.2@sha256:df3e9bf760df89c644d28d9c91e8d25ba52ab7759b54b5cbae2301c4f17a0cb9 #4 sha256:524ba9af531aef5ed83ba2f537f6c271bc5f417402597cddfbae2ad06d01299c #4 DONE 0.0s #5 [2/5] WORKDIR /app #5 sha256:4e298d723636cf44229207b8e9ff0a96c22570110c3f318ea81e18c16c43eb2b #5 CACHED #6 [internal] load build context #6 sha256:ee20b71ca3f1b3bde3d0b1065def2f21b7a46120426d8c5ab2160076f3bf0eae #6 transferring context: 38B done #6 DONE 0.0s #7 [3/5] COPY reproduce_bug.exs reproduce_bug.exs #7 sha256:6d990d326d70d25618512e982c831aba45312e3428ccfe28715ba175c6eb66af #7 DONE 0.0s #8 [4/5] RUN mix do local.hex --force, local.rebar --force #8 sha256:04809ac8f07a861001ddb2ce81e115835ff1456736d437602c091953b22d8625 #8 1.135 * creating /root/.mix/archives/hex-1.0.1 #8 1.629 * creating /root/.mix/rebar #8 2.250 * creating /root/.mix/rebar3 #8 DONE 2.3s #9 [5/5] RUN elixir reproduce_bug.exs #9 sha256:8c1177a76ba6d217ea4d63b6d2a9df18894ba31cf9e275882cf7ee6633b16f10 #9 0.687 Resolving Hex dependencies... #9 0.785 Dependency resolution completed: #9 0.785 New: #9 0.785 castore 0.1.17 #9 0.785 jason 1.3.0 #9 0.785 nimble_lz4 0.1.2 #9 0.785 rustler 0.25.0 #9 0.785 rustler_precompiled 0.5.1 #9 0.785 toml 0.6.2 #9 0.787 * Getting nimble_lz4 (Hex package) #9 0.874 * Getting rustler (Hex package) #9 0.902 * Getting rustler_precompiled (Hex package) #9 0.980 * Getting castore (Hex package) #9 0.985 * Getting jason (Hex package) #9 0.989 * Getting toml (Hex package) #9 1.056 ==> jason #9 1.056 Compiling 10 files (.ex) #9 2.272 Generated jason app #9 2.289 ==> castore #9 2.289 Compiling 1 file (.ex) #9 2.300 Generated castore app #9 2.331 ==> toml #9 2.331 Compiling 10 files (.ex) #9 2.825 Generated toml app #9 2.849 ==> rustler #9 2.849 Compiling 7 files (.ex) #9 2.977 Generated rustler app #9 2.996 ==> rustler_precompiled #9 2.996 Compiling 4 files (.ex) #9 3.222 Generated rustler_precompiled app #9 3.243 ==> nimble_lz4 #9 3.243 Compiling 1 file (.ex) #9 3.273 #9 3.273 08:04:42.533 [debug] Downloading NIF from https://github.com/whatyouhide/nimble_lz4/releases/download/v0.1.2/libnimblelz4-v0.1.2-nif-2.15-aarch64-unknown-linux-musl.so.tar.gz #9 4.245 #9 4.245 08:04:43.506 [debug] NIF cached at /root/.cache/rustler_precompiled/precompiled_nifs/libnimblelz4-v0.1.2-nif-2.15-aarch64-unknown-linux-musl.so.tar.gz and extracted to /root/.cache/mix/installs/elixir-1.13.4-erts-11.2.2.8/7e245c35d90a0fec919d11f60ee80a7b/_build/dev/lib/nimble_lz4/priv/native/libnimblelz4-v0.1.2-nif-2.15-aarch64-unknown-linux-musl.so #9 ERROR: executor failed running [/bin/sh -c elixir reproduce_bug.exs]: exit code: 139 ------ > [5/5] RUN elixir reproduce_bug.exs: ------ executor failed running [/bin/sh -c elixir reproduce_bug.exs]: exit code: 139 andrea ~/Code/xandra (andrea/public-lz4 *+) → docker build . --no-cache -f Dockerfile --progress=plain -t tmp-test-to-remove && docker run --rm -it tmp-test-to-remove /bin/sh #1 [internal] load build definition from Dockerfile #1 sha256:1315230573096377824b74f209ac2ab9c5f62e7b29284278955bac9b3a8c2fd4 #1 transferring dockerfile: 236B 0.0s done #1 DONE 0.0s #2 [internal] load .dockerignore #2 sha256:1e5862993a999d639eafcedfa8fb4942f13918a31611bb91d365e51db2594ab9 #2 transferring context: 2B done #2 DONE 0.0s #3 [internal] load metadata for docker.io/hexpm/elixir:1.13.4-erlang-23.3.4.9-alpine-3.14.2 #3 sha256:b8e862e530da8bdbcc6bcc9c374ed15293bcad31cc3f9afb487fd342a51a2c1c #3 DONE 0.9s #8 [1/5] FROM docker.io/hexpm/elixir:1.13.4-erlang-23.3.4.9-alpine-3.14.2@sha256:df3e9bf760df89c644d28d9c91e8d25ba52ab7759b54b5cbae2301c4f17a0cb9 #8 sha256:524ba9af531aef5ed83ba2f537f6c271bc5f417402597cddfbae2ad06d01299c #8 DONE 0.0s #4 [2/5] WORKDIR /app #4 sha256:4e298d723636cf44229207b8e9ff0a96c22570110c3f318ea81e18c16c43eb2b #4 CACHED #9 [internal] load build context #9 sha256:4f0e5f38fe6ec4187f5c7d50d9e489e04a3229b0450540b0a22b550c74a6ea3c #9 transferring context: 85B 0.0s done #9 DONE 0.0s #5 [3/5] COPY reproduce_bug.exs reproduce_bug.exs #5 sha256:72a1e0ead5620ca66bcbb5d56cbc19b9905ea27db07f6d209ec6f374e15b8543 #5 DONE 0.0s #6 [4/5] RUN mix do local.hex --force, local.rebar --force #6 sha256:9036ac68e8469719963de7ce1859ad18fbd4e0caedbca18b293bf692e5ba3444 #6 1.099 * creating /root/.mix/archives/hex-1.0.1 #6 1.576 * creating /root/.mix/rebar #6 2.263 * creating /root/.mix/rebar3 #6 DONE 2.3s #7 [5/5] RUN elixir reproduce_bug.exs #7 sha256:d500fa3e278df5ea36ca6be8d8f0f536c059fb104cfc36f4f9548a712ac7d24d #7 0.910 Resolving Hex dependencies... #7 1.011 Dependency resolution completed: #7 1.012 New: #7 1.012 castore 0.1.17 #7 1.012 jason 1.3.0 #7 1.012 nimble_lz4 0.1.2 #7 1.012 rustler 0.25.0 #7 1.012 rustler_precompiled 0.5.1 #7 1.012 toml 0.6.2 #7 1.013 * Getting nimble_lz4 (Hex package) #7 1.146 * Getting rustler (Hex package) #7 1.177 * Getting rustler_precompiled (Hex package) #7 1.184 * Getting castore (Hex package) #7 1.190 * Getting jason (Hex package) #7 1.196 * Getting toml (Hex package) #7 1.276 ==> jason #7 1.276 Compiling 10 files (.ex) #7 2.433 Generated jason app #7 2.453 ==> castore #7 2.453 Compiling 1 file (.ex) #7 2.467 Generated castore app #7 2.502 ==> toml #7 2.502 Compiling 10 files (.ex) #7 3.039 Generated toml app #7 3.065 ==> rustler #7 3.065 Compiling 7 files (.ex) #7 3.191 Generated rustler app #7 3.210 ==> rustler_precompiled #7 3.210 Compiling 4 files (.ex) #7 3.436 Generated rustler_precompiled app #7 3.458 ==> nimble_lz4 #7 3.458 Compiling 1 file (.ex) #7 3.489 #7 3.489 08:07:35.243 [debug] Downloading NIF from https://github.com/whatyouhide/nimble_lz4/releases/download/v0.1.2/libnimblelz4-v0.1.2-nif-2.15-aarch64-unknown-linux-musl.so.tar.gz #7 4.231 #7 4.231 08:07:35.987 [debug] NIF cached at /root/.cache/rustler_precompiled/precompiled_nifs/libnimblelz4-v0.1.2-nif-2.15-aarch64-unknown-linux-musl.so.tar.gz and extracted to /root/.cache/mix/installs/elixir-1.13.4-erts-11.2.2.8/7e245c35d90a0fec919d11f60ee80a7b/_build/dev/lib/nimble_lz4/priv/native/libnimblelz4-v0.1.2-nif-2.15-aarch64-unknown-linux-musl.so #7 ERROR: executor failed running [/bin/sh -c elixir reproduce_bug.exs]: exit code: 139 ```

Interestingly enough, this seems to work if I bump the OTP version (using the hexpm/elixir:1.13.4-erlang-24.2-alpine-3.14.2 Docker image). Is this a bug in OTP, or is something going on with RustlerPrecompiled as far as you can tell?

philss commented 2 years ago

@whatyouhide this is unfortunate :/

I think this might be a bug with this OTP version on systems with musl - I tested with a Debian container and it works fine. Testing with other libs (html5ever and explorer) also give me a segfault :/

I will try to debug this later today.

whatyouhide commented 2 years ago

Ah @philss, that's great to know that this is not an isolated issue. Thanks for investigating 💟

philss commented 2 years ago

@whatyouhide I made some progress in the debugging, but I couldn't find the root cause yet.

What I found:

My hypothesis is that something in the build step of the Erlang image broke the load of NIFs for that OTP version. But I'm not sure if this is correct, because the build for version 24 is identical. Maybe this is related to stripped binaries.

I'm going to try building from source with that Dockerfile, but without stripping the binaries.

philss commented 2 years ago

Building from scratch does not help. It still have a segfault. But at least I could get more details about the error. Running gdb gives me this backtrace:

warning: Could not load shared library symbols for /root/.cache/mix/installs/elixir-1.13.4-erts-11.2.2.8/7e245c35d90a0fec919d11f60ee80a7b/_build/dev/lib/nimble_lz4/priv/native/libnimblelz4-v0.1.2-nif-2.15-x86_64-unknown-linux-musl.so.
Do you need "set solib-search-path" or "set sysroot"?
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/usr/local/lib/erlang/erts-11.2.2.8/bin/beam.smp -- -root /usr/local/lib/erlang'.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Section `.reg-xstate/94' in core file too small.
#0  0x00007f7b7718980b in strlen () from /lib/ld-musl-x86_64.so.1
[Current thread is 1 (LWP 94)]
(gdb) backtrace
#0  0x00007f7b7718980b in strlen () from /lib/ld-musl-x86_64.so.1
#1  0x000056469785044d in erts_printf_format (fn=fn@entry=0x564697852130 <erts_write_ds>, arg=arg@entry=0x7f7b36b00230,
    fmt=fmt@entry=0x5646978b0f48 "That '%T' NIF library needs %s or newer. Either try to recompile the NIF lib or use a newer erts runtime.", ap=ap@entry=0x7f7b2e40f670)
    at common/erl_printf_format.c:770
#2  0x0000564697852e17 in erts_vdsprintf (dsbufp=dsbufp@entry=0x7f7b36b00230,
    format=0x5646978b0f48 "That '%T' NIF library needs %s or newer. Either try to recompile the NIF lib or use a newer erts runtime.", format@entry=0x7f7b2e40f668 "",
    arglist=arglist@entry=0x7f7b2e40f670) at common/erl_printf.c:471
#3  0x000056469779dfb4 in load_nif_error (p=p@entry=0x7f7b30a40f98, atom=atom@entry=0x5646978b1248 <bad_lib> "bad_lib", format=0x7f7b2e40f668 "") at beam/erl_nif.c:4162
#4  0x00005646977a7ba7 in erts_load_nif (c_p=0x7f7b30a40f98, I=<optimized out>, filename=<optimized out>, args=15) at beam/erl_nif.c:4432
#5  0x00005646975f5f7d in process_main (x_reg_array=0x7f7b2fe16580, f_reg_array=0x8080808080808080) at x86_64-pc-linux-musl/opt/smp/beam_cold.h:491
#6  0x00005646976027bc in sched_thread_func (vesdp=0x7f7b2fec5280) at beam/erl_process.c:8560
#7  0x000056469784dc93 in thr_wrapper (vtwd=0x7ffdb433d790) at pthread/ethread.c:122
#8  0x00007f7b7718c1bb in ?? () from /lib/ld-musl-x86_64.so.1
#9  0x0000000000000000 in ?? ()

Perhaps the machine/container that is compiling the NIF (at GitHub Actions CI server) is generating an outdated structure, or something like that. Tomorrow I will try to test releasing the NIF with another container.

whatyouhide commented 2 years ago

That's awesome @philss, thank you so much for investigating. I will have a little bit more time over the weekend to dedicate to this, so let me know how things will be looking by then! 💟

philss commented 2 years ago

@whatyouhide I found the problem! It was a missing configuration for cross :disappointed:

I totally missed the section about passing down environment variables. In my mind it was being passed automatically from the host, but this is not true.

We need the RUSTLER_NIF_VERSION to be set in order to generate the correct lib binary. This explains why it works fine for OTP 24 and 25, because we use the latest NIF version in case that var is missing.

I'm going to update the docs and send PRs to the projects using rustler_precompiled.

whatyouhide commented 2 years ago

Aaaaah, that's fantastic! Great catch. And thanks for the PR to nimble_lz4 💟