Closed tpitale closed 6 months ago
Going to try to get an example up to show the issue.
I'm using erlkaf currently with OTP 26.0.2 without any problems. I have no experience with elixir releases but if you have an example that I can use to replicate I can try my best.
If I dump the procinfo, it appears stuck in erl_init
:
=proc:<0.0.0>
State: Waiting
Name: init
Spawned as: erl_init:start/2
Spawned by: []
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [<0.9.0>, <0.664.0>, <0.662.0>, <0.10.0>, {from,<0.675.0>,#Ref<0.3728866251.4178575368.258785>}]
Reductions: 10112
Stack+heap: 987
OldHeap: 0
Heap unused: 291
OldHeap unused: 0
BinVHeap: 0
OldBinVHeap: 0
BinVHeap unused: 46422
OldBinVHeap unused: 46422
Memory: 8888
Stack dump:
Program counter: 0x0000ffff521474d4 (init:boot_loop/2 + 84)
arity = 0
y(0) []
y(1) {state,[{root,[<<"/app/_build/prod/rel/erlkaf_example">>]},{bindir,[<<"/app/_build/prod/rel/erlkaf_example/erts-14.2.2/bin">>]},{progname,[<<"erl">>]},{home,[<<"/root">>]},{noshell,[]},{elixir,[<<"ansi_enabled">>,<<"true">>]},{user,[<<"elixir">>]},{mode,[<<"embedded">>]},{setcookie,[<<"COOKIE">>]},{sname,[<<"erlkaf_example">>]},{config,[<<"/app/_build/prod/rel/erlkaf_example/releases/0.1.0/sys">>]},{boot,[<<"/app/_build/prod/rel/erlkaf_example/releases/0.1.0/start">>]},{boot_var,[<<"RELEASE_LIB">>,<<"/app/_build/prod/rel/erlkaf_example/lib">>]}],[<<"--no-halt">>,<<"+iex">>],[[elixir,start_iex]],[{application_controller,<0.664.0>},{logger,<0.662.0>},{erl_prim_loader,<0.10.0>}],<0.9.0>,{starting,applications_loaded},{"erlkaf_example","0.1.0"},[],[],#{},{"/app","/app/_build/prod/rel/erlkaf_example/releases/0.1.0/start.boot"}}
y(2) <0.9.0>
Yeah, we'll get an example up.
And this is the stack dump:
Stack dump:
Program counter: 0x0000ffff5269ccfc (io:execute_request/3 + 364)
arity = 0
y(0) #Ref<0.3728866251.4178575368.258785>
y(1) #Ref<0.3728866251.4178575368.258784>
y(2) {false,{put_chars,unicode,io_lib,format,[<<"Loading library: ~p ~n">>,["/app/_build/prod/rel/erlkaf_example/lib/erlkaf-2.1.4/priv/erlkaf_nif"]]}}
y(3) <0.0.0>
0x0000fffeca845800 Return addr 0x0000ffff5269bb3c (io:format/2 + 212)
y(0) [<<"Loading library: ~p ~n">>,["/app/_build/prod/rel/erlkaf_example/lib/erlkaf-2.1.4/priv/erlkaf_nif"]]
y(1) []
y(2) #Ref<0.3728866251.4178575368.258784>
0x0000fffeca845820 Return addr 0x0000ffff5300bd58 (erlkaf_nif:load_nif/0 + 152)
y(0) "/app/_build/prod/rel/erlkaf_example/lib/erlkaf-2.1.4/priv/erlkaf_nif"
0x0000fffeca845830 Return addr 0x0000ffff52152098 (init:'-run_on_load_handler/2-fun-0-'/1 + 96)
0x0000fffeca845838 Return addr 0x0000ffff521413c8 (<terminate process normally>)
Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL
=port:#Port<0.0>
State: CONNECTED
Slot: 0
Connected: <0.0.0>
Port controls forker process: forker
Input: 0
Output: 0
Queue: 0
=port:#Port<0.1>
State: CONNECTED|BINARY_IO
Slot: 8
Connected: <0.672.0>
Links: <0.672.0>
Port is UNIX fd not opened by emulator: 2/2
Input: 0
Output: 0
Queue: 0
The path is correct for the .so file. I'm don't believe the mix release
does anything to that file beyond moving it into place. Going to try to compare the .so from the compile step to the released version.
root@097ff5af7551:/app# sha256sum _build/prod/lib/erlkaf/priv/erlkaf_nif.so
d25f7997e76366000900d5581287ca928c5800410983df5862de27e979122ee2 _build/prod/lib/erlkaf/priv/erlkaf_nif.so
root@097ff5af7551:/app# sha256sum _build/prod/rel/erlkaf_example/lib/erlkaf-2.1.4/priv/erlkaf_nif.so
d25f7997e76366000900d5581287ca928c5800410983df5862de27e979122ee2 _build/prod/rel/erlkaf_example/lib/erlkaf-2.1.4/priv/erlkaf_nif.so
Matches 🤔
When I run from the compiled/non-released erlkaf has no issue loading the nif. And since they match … this really seems like an elixir release issue.
I'll open something there.
can you provide me a small example I can run and the steps to run it (I have no elixir knowledge and will save me some time to dig on how to run an elixir release).
@silviucpp Here's one that we tried to strip down: https://github.com/dmccown/erlkaf_example
José was able to identify the erlang issue. Does not appear to be erlkaf or elixir. We will open an issue there once I create a rebar3 project example.
@silviucpp Looks like you've maybe dealt with this before: https://github.com/silviucpp/elibphonenumber/commit/b62b0cf987fa84cbb788269fdb1219bd15f598f9
Confirmed locally if we remove the io:format
here: https://github.com/silviucpp/erlkaf/blob/41d22f7ab9aeba8557e19dc12e0be1847f75d604/src/erlkaf_nif.erl#L30, the application runs as expected.
Would you be okay with a PR to remove that or another approach maybe?
:)) That's very strange. I don't get why is not replicating for me as well. but send a pull request and I will merge . I have no clue if it's documented in erlang that this is forbidden in that function or it's a bug ..
@tpitale please open a pull request.
Will do, Monday morning. Thanks!
When we start our application, it get's "stuck" or "freezes" before doing anything that we can observe. It doesn't crash, it doesn't time out, it never seems to start our application code.
Not sure of a better way to describe what's happening. An application (happens to be elixir) that was working on OTP 24 (haven't tried 25, yet) no longer works when we try to start the release on OTP 26.
I've currently got an application built that only uses erlkaf, and does nothing else but try to load. I've gone so far as to try to modify the release's
start_clean.script
file to try to find out what is happening, or narrow it down.Before running
mix release
, we can run aniex
console with erlkaf operating correctly. We've compiled with theprod
environment.We're just trying to narrow it down at this point. If you have any ideas what it could be, or perhaps have a simple erlkaf erlang example we can try (since maybe it's an elixir issue in the release?). Perhaps you've encountered something like this before.
We'll keep digging!