silviucpp / erlkaf

Erlang kafka driver based on librdkafka
MIT License
84 stars 41 forks source link

Release on OTP 26 "freezing" on startup #70

Closed tpitale closed 6 months ago

tpitale commented 7 months ago

When we start our application, it get's "stuck" or "freezes" before doing anything that we can observe. It doesn't crash, it doesn't time out, it never seems to start our application code.

Not sure of a better way to describe what's happening. An application (happens to be elixir) that was working on OTP 24 (haven't tried 25, yet) no longer works when we try to start the release on OTP 26.

I've currently got an application built that only uses erlkaf, and does nothing else but try to load. I've gone so far as to try to modify the release's start_clean.script file to try to find out what is happening, or narrow it down.

Before running mix release, we can run an iex console with erlkaf operating correctly. We've compiled with the prod environment.

We're just trying to narrow it down at this point. If you have any ideas what it could be, or perhaps have a simple erlkaf erlang example we can try (since maybe it's an elixir issue in the release?). Perhaps you've encountered something like this before.

We'll keep digging!

tpitale commented 7 months ago

Going to try to get an example up to show the issue.

silviucpp commented 7 months ago

I'm using erlkaf currently with OTP 26.0.2 without any problems. I have no experience with elixir releases but if you have an example that I can use to replicate I can try my best.

tpitale commented 7 months ago

If I dump the procinfo, it appears stuck in erl_init:

=proc:<0.0.0>
State: Waiting
Name: init
Spawned as: erl_init:start/2
Spawned by: []
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [<0.9.0>, <0.664.0>, <0.662.0>, <0.10.0>, {from,<0.675.0>,#Ref<0.3728866251.4178575368.258785>}]
Reductions: 10112
Stack+heap: 987
OldHeap: 0
Heap unused: 291
OldHeap unused: 0
BinVHeap: 0
OldBinVHeap: 0
BinVHeap unused: 46422
OldBinVHeap unused: 46422
Memory: 8888
Stack dump:
Program counter: 0x0000ffff521474d4 (init:boot_loop/2 + 84)
arity = 0
y(0)     []
y(1)     {state,[{root,[<<"/app/_build/prod/rel/erlkaf_example">>]},{bindir,[<<"/app/_build/prod/rel/erlkaf_example/erts-14.2.2/bin">>]},{progname,[<<"erl">>]},{home,[<<"/root">>]},{noshell,[]},{elixir,[<<"ansi_enabled">>,<<"true">>]},{user,[<<"elixir">>]},{mode,[<<"embedded">>]},{setcookie,[<<"COOKIE">>]},{sname,[<<"erlkaf_example">>]},{config,[<<"/app/_build/prod/rel/erlkaf_example/releases/0.1.0/sys">>]},{boot,[<<"/app/_build/prod/rel/erlkaf_example/releases/0.1.0/start">>]},{boot_var,[<<"RELEASE_LIB">>,<<"/app/_build/prod/rel/erlkaf_example/lib">>]}],[<<"--no-halt">>,<<"+iex">>],[[elixir,start_iex]],[{application_controller,<0.664.0>},{logger,<0.662.0>},{erl_prim_loader,<0.10.0>}],<0.9.0>,{starting,applications_loaded},{"erlkaf_example","0.1.0"},[],[],#{},{"/app","/app/_build/prod/rel/erlkaf_example/releases/0.1.0/start.boot"}}
y(2)     <0.9.0>
tpitale commented 7 months ago

Yeah, we'll get an example up.

tpitale commented 7 months ago

And this is the stack dump:

Stack dump:
Program counter: 0x0000ffff5269ccfc (io:execute_request/3 + 364)
arity = 0
y(0)     #Ref<0.3728866251.4178575368.258785>
y(1)     #Ref<0.3728866251.4178575368.258784>
y(2)     {false,{put_chars,unicode,io_lib,format,[<<"Loading library: ~p ~n">>,["/app/_build/prod/rel/erlkaf_example/lib/erlkaf-2.1.4/priv/erlkaf_nif"]]}}
y(3)     <0.0.0>

0x0000fffeca845800 Return addr 0x0000ffff5269bb3c (io:format/2 + 212)
y(0)     [<<"Loading library: ~p ~n">>,["/app/_build/prod/rel/erlkaf_example/lib/erlkaf-2.1.4/priv/erlkaf_nif"]]
y(1)     []
y(2)     #Ref<0.3728866251.4178575368.258784>

0x0000fffeca845820 Return addr 0x0000ffff5300bd58 (erlkaf_nif:load_nif/0 + 152)
y(0)     "/app/_build/prod/rel/erlkaf_example/lib/erlkaf-2.1.4/priv/erlkaf_nif"

0x0000fffeca845830 Return addr 0x0000ffff52152098 (init:'-run_on_load_handler/2-fun-0-'/1 + 96)

0x0000fffeca845838 Return addr 0x0000ffff521413c8 (<terminate process normally>)
Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL
=port:#Port<0.0>
State: CONNECTED
Slot: 0
Connected: <0.0.0>
Port controls forker process: forker
Input: 0
Output: 0
Queue: 0
=port:#Port<0.1>
State: CONNECTED|BINARY_IO
Slot: 8
Connected: <0.672.0>
Links: <0.672.0>
Port is UNIX fd not opened by emulator: 2/2
Input: 0
Output: 0
Queue: 0
tpitale commented 7 months ago

The path is correct for the .so file. I'm don't believe the mix release does anything to that file beyond moving it into place. Going to try to compare the .so from the compile step to the released version.

tpitale commented 7 months ago
root@097ff5af7551:/app# sha256sum _build/prod/lib/erlkaf/priv/erlkaf_nif.so
d25f7997e76366000900d5581287ca928c5800410983df5862de27e979122ee2  _build/prod/lib/erlkaf/priv/erlkaf_nif.so
root@097ff5af7551:/app# sha256sum _build/prod/rel/erlkaf_example/lib/erlkaf-2.1.4/priv/erlkaf_nif.so
d25f7997e76366000900d5581287ca928c5800410983df5862de27e979122ee2  _build/prod/rel/erlkaf_example/lib/erlkaf-2.1.4/priv/erlkaf_nif.so

Matches 🤔

tpitale commented 7 months ago

When I run from the compiled/non-released erlkaf has no issue loading the nif. And since they match … this really seems like an elixir release issue.

I'll open something there.

silviucpp commented 7 months ago

can you provide me a small example I can run and the steps to run it (I have no elixir knowledge and will save me some time to dig on how to run an elixir release).

tpitale commented 7 months ago

@silviucpp Here's one that we tried to strip down: https://github.com/dmccown/erlkaf_example

tpitale commented 7 months ago

José was able to identify the erlang issue. Does not appear to be erlkaf or elixir. We will open an issue there once I create a rebar3 project example.

tpitale commented 7 months ago

@silviucpp Looks like you've maybe dealt with this before: https://github.com/silviucpp/elibphonenumber/commit/b62b0cf987fa84cbb788269fdb1219bd15f598f9

tpitale commented 7 months ago

Confirmed locally if we remove the io:format here: https://github.com/silviucpp/erlkaf/blob/41d22f7ab9aeba8557e19dc12e0be1847f75d604/src/erlkaf_nif.erl#L30, the application runs as expected.

Would you be okay with a PR to remove that or another approach maybe?

silviucpp commented 7 months ago

:)) That's very strange. I don't get why is not replicating for me as well. but send a pull request and I will merge . I have no clue if it's documented in erlang that this is forbidden in that function or it's a bug ..

silviucpp commented 7 months ago

@tpitale please open a pull request.

tpitale commented 7 months ago

Will do, Monday morning. Thanks!

tpitale commented 6 months ago

PR: https://github.com/silviucpp/erlkaf/pull/71