ninenines / gun

HTTP/1.1, HTTP/2, Websocket client (and more) for Erlang/OTP.
ISC License
891 stars 232 forks source link

Connection process stopped handling requests #334

Open IgorKoshnarov opened 3 months ago

IgorKoshnarov commented 3 months ago

I use gun 2.0.0 as http client in my application and use once opened connections to handle multiple requests.

If for some request gun:await(Connection, StreamRef, Timeout) returns {error, Reason} I close the connection and it's not used any more, excepting {error, timeout} case, in this case I consider connection still valid and continue using it.

It most of the time works ok, but recently my application stopped handling requests and it turned out that gun connections processes have lists of unhandled messages in their mailboxes, like this:

rp(process_info(<0.31175.48>, messages)).
{messages,[{'$gen_cast',{request,<0.11099.53>,
                                 #Ref<0.3234177338.3490709505.118665>,<<"POST">>,
                                 <<"/v1/topic/get">>,
                                  [{<<"content-type">>,<<"application/json">>}],
                                 <<"{}">>,infinity}},
           {'$gen_cast',{request,<0.12338.53>,
                                 #Ref<0.3234177338.3095658499.71403>,<<"POST">>,
                                 <<"/v1/topic/get">>,
                                  [{<<"content-type">>,<<"application/json">>}],
                                 <<"{}">>,infinity}}]}

And every new request just adds a new message to that mailbox.

I've looked into gun's code and I see that it should handle such messages with no problem in 'connected' state, here is the line from gun:

connected(cast, {request, ReplyTo, StreamRef, Method, Path, Headers, Body, InitialFlow},
        State=#state{origin_host=Host, origin_port=Port,
            protocol=Protocol, protocol_state=ProtoState, cookie_store=CookieStore0,
            event_handler=EvHandler, event_handler_state=EvHandlerState0}) ->
    {Commands, CookieStore, EvHandlerState} = Protocol:request(ProtoState,
        dereference_stream_ref(StreamRef, State), ReplyTo,
        Method, Host, Port, Path, Headers, Body,
        InitialFlow, CookieStore0, EvHandler, EvHandlerState0),
    commands(Commands, State#state{cookie_store=CookieStore,
        event_handler_state=EvHandlerState});

I tried to see the status of that connection process (with big timeout just in case) : sys:get_status(<0.31175.48>, 600000).

But it didn't work, I've just found the new system message in gun's connection process mailbox: {system,{<0.22180.130>,#Ref<0.3234177338.4042522626.87030>}, get_status}

So I can't find out what state is this gen_statem server is in and I don't know what caused it to switch into this state from 'connected' state. The process_info looks like this:

process_info(<0.31175.48>).          
[{current_function,{gen,do_call,4}},
 {initial_call,{proc_lib,init_p,5}},
 {status,waiting},
 {message_queue_len,8},
 {links,[<0.1197.0>]},
 {dictionary,[{'$initial_call',{gun,init,1}},
              {'$ancestors',[gun_conns_sup,gun_sup,<0.1195.0>]}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.1194.0>},
 {total_heap_size,8370},
 {heap_size,4185},
 {stack_size,63},
 {reductions,1402917},
 {garbage_collection,[{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                      {min_bin_vheap_size,46422},
                      {min_heap_size,233},
                      {fullsweep_after,65535},
                      {minor_gcs,6}]},
 {suspending,[]}]

The protocol used by connection is http2 and secure layer is on.

I can't figure out reasons why that could happen. I'm sure that this process was in state 'connected' because I stored it's pid in my pool of connections when the connection was established.

essen commented 3 months ago

If this {current_function,{gen,do_call,4}} never changes, you could try tracing the process with dbg (or recon which can disable the trace automatically after a few seconds).

If you see nothing, then I suppose the process is really stuck in gen:do_call/4. In that case it's likely to be a bug in OTP. There's been a bug or two that were fixed in patch releases more or less recently that you might be running into.

IgorKoshnarov commented 3 months ago

Thanks, yes, {current_function, {gen, do_call, 4}} is always the same. I'll try to dbg and report here if something pops up.

IgorKoshnarov commented 3 months ago

I couldn't dbg somehow, I tried to erl -remsh to my app and open trace port, but got error:

dbg:tracer(port, dbg:trace_port(ip, 50000)).
** exception error: undefined function dbg:trace_port/2

The erlang version on the server is 23.

Also I did rp(process_Info(<0.3238.0>, backtrace)).

and got this: {backtrace,<<"Program counter: 0x00007f3542681490 (gen:do_call/4 + 280)\ny(0) []\ny(1) infinity\ny(2) []\ny(3) #Ref<0.3234177338.3271557122.27248>\ny(4) <0.21289.50>\n\n0x00007f34ab0c1630 Return addr 0x00007f34c72fb808 (gen_statem:call_dirty/4 + 104)\ny(0) infinity\ny(1) {start,infinity}\ny(2) <0.21289.50>\ny(3) Catch 0x00007f34c72fb840 (gen_statem:call_dirty/4 + 160)\n\n0x00007f34ab0c1658 Return addr 0x00007f3542a7e1c8 (ssl_gen_statem:call/2 + 72)\ny(0) Catch 0x00007f3542a7e1d8 (ssl_gen_statem:call/2 + 88)\n\n0x00007f34ab0c1668 Return addr 0x00007f3542a751d0 (ssl_gen_statem:handshake/2 + 152)\ny(0) []\ny(1) {sslsocket,{gen_tcp,#Port<0.5296215>,tls_connection,undefined},[<0.21289.50>,<0.22818.50>]}\n\n0x00007f34ab0c1680 Return addr 0x00007f3542a94790 (tls_gen_connection:start_fsm/8 + 1240)\ny(0) []\ny(1) []\ny(2) []\ny(3) []\ny(4) []\ny(5) []\ny(6) []\ny(7) []\ny(8) []\ny(9) []\ny(10) Catch 0x00007f3542a947a0 (tls_gen_connection:start_fsm/8 + 1256)\n\n0x00007f34ab0c16e0 Return addr 0x00007f3542a74fc8 (ssl_gen_statem:connect/8 + 104)\ny(0) Catch 0x00007f3542a74fd8 (ssl_gen_statem:connect/8 + 120)\n\n0x00007f34ab0c16f0 Return addr 0x00007f34c5189890 (gun:normal_tls_handshake/4 + 384)\ny(0) []\ny(1) gun_default_event_h\ny(2) []\ny(3) [http2,http]\ny(4) #{socket=>#Port<0.5296215>,timeout=>infinity,tls_opts=>[{alpn_advertised_protocols,[<<\"http/1.1\">>,<<\"h2\">>]},{client_preferred_next_protocols,{client,[<<\"http/1.1\">>,<<\"h2\">>],<<\"http/1.1\">>}},{server_name_indication,\"management.me\"},{alpn_advertised_protocols,[<<\"http/1.1\">>,<<\"h2\">>]},{client_preferred_next_protocols,{client,[<<\"http/1.1\">>,<<\"h2\">>],<<\"http/1.1\">>}}]}\ny(5) {state,<0.3237.0>,{up,#Ref<0.3234177338.2967732226.142611>},\"management.me\",443,<<\"https\">>,\"management.me\",443,[],#{connect_timeout=>3000,http_opts=>#{keepalive=>5000},retry=>3,retry_timeout=>1000},undefined,undefined,gun_tls,true,{ssl,ssl_closed,ssl_error},undefined,undefined,undefined,gun_default_event_h,undefined}\ny(6) undefined\n\n0x00007f34ab0c1730 Return addr 0x00007f34c5188518 (gun:initial_tls_handshake/3 + 416)\ny(0) []\ny(1) []\ny(2) 2\ny(3) []\ny(4) []\n\n0x00007f34ab0c1760 Return addr 0x00007f34c72fea10 (gen_statem:loop_state_callback/11 + 872)\ny(0) []\ny(1) []\ny(2) true\ny(3) false\ny(4) []\ny(5) false\ny(6) []\ny(7) {initial_tls_handshake,{state,<0.3237.0>,{up,#Ref<0.3234177338.2967732226.142611>},\"management.me\",443,<<\"https\">>,\"management.me\",443,[],#{connect_timeout=>3000,http_opts=>#{keepalive=>5000},retry=>3,retry_timeout=>1000},undefined,undefined,gun_tls,true,{ssl,ssl_closed,ssl_error},undefined,undefined,undefined,gun_default_event_h,undefined}}\ny(8) [{internal,{retries,2,#Port<0.5296215>}}]\ny(9) {state,{initial_tls_handshake,{state,<0.3237.0>,{up,#Ref<0.3234177338.2967732226.142611>},\"management.me\",443,<<\"https\">>,\"management.me\",443,[],#{connect_timeout=>3000,http_opts=>#{keepalive=>5000},retry=>3,retry_timeout=>1000},undefined,undefined,gun_tls,true,{ssl,ssl_closed,ssl_error},undefined,undefined,undefined,gun_default_event_h,undefined}},[],#{},false}\ny(10) []\ny(11) {params,state_functions,false,<0.1197.0>,[gun],<0.3238.0>,infinity}\ny(12) Catch 0x00007f34c72fea80 (gen_statem:loop_state_callback/11 + 984)\n\n0x00007f34ab0c17d0 Return addr 0x00007f34c7c83c00 (proc_lib:init_p/5 + 264)\ny(0) []\ny(1) []\ny(2) []\ny(3) Catch 0x00007f34c7c83c10 (proc_lib:init_p/5 + 280)\n\n0x00007f34ab0c17f8 Return addr 0x000000000092aa38 (<terminate process normally>)\n">>}

But this didn't give me any clue.

essen commented 3 months ago

Just do

dbg:tracer().
dbg:tpl(gen).
dbg:tpl(gen_statem).
dbg:tpl(gun).
dbg:p(<0.31175.48>, c).
IgorKoshnarov commented 3 months ago

It failed on the first command:

dbg:tracer().
** exception error: undefined function dbg:tracer/0

It seems that module dbg is not present on the server.

code:is_loaded(dbg).
false

tried to load it:

code:ensure_loaded(dbg).
{error,embedded}
essen commented 3 months ago

Ah well you probably are running a release that doesn't include runtime_tools. Make note to always include runtime_tools in the future to allow debugging things. I would also recommend adding https://github.com/zhongwencool/observer_cli (which also depends on recon).

In any case if you are on 23 I would recommend upgrading. What you are seeing looks like a bug in OTP, which may have been fixed since. At the very least make sure you are on the latest 23 patch release (OTP-23.3.4.20).

Then if you run into this again, with runtime_tools we would have an easier time confirming that the process is, indeed, stuck. Then a report can be sent to OTP team.

IgorKoshnarov commented 3 months ago

Ok, thanks, currently I see this version: `cat /usr/lib/erlang/releases/RELEASES

%% coding: utf-8 [{release,"Erlang/OTP","23","11.2.2.18", [{kernel,"7.3.1.7","/usr/lib/erlang/lib/kernel-7.3.1.7"}, {stdlib,"3.14.2.3","/usr/lib/erlang/lib/stdlib-3.14.2.3"}, {sasl,"4.0.2","/usr/lib/erlang/lib/sasl-4.0.2"}], permanent}].`

essen commented 3 months ago

That's OTP-23.3.4.18.

OTP-23.3.4.19 fixes an issue where processes that use max_heap_size (that you might have set) to become zombies, as well as issues around process signal queues.

OTP-23.3.4.20 doesn't seem to have a relevant fix but doesn't hurt to use it.

IgorKoshnarov commented 3 months ago

I'll put here the reply which I got from Fred Herbert on erlangforums, based on my backtrace:

https://erlangforums.com/t/gun-gen-statem-process-queue-messages-stoped-being-handled/3413/5

Your FSM is currently stuck in a call to another FSM. In this case it appears to be related to TLS connection handling, during the handshake. The pid of that FSM Is <0.21289.50>.

Since that call has an infinity timeout, your own FSM is gonna be stuck until the other one returns; if it’s hung (because it’s waiting on your FSM or on a broken external connection), then you’re also going to be stuck forever.

Your options are likely to be one of:

put a timeout on the call (and possibly handle the errors)
move the call to be asynchronous to a sub-process so even if it hangs it doesn’t block the FSM
investigate and resolve the issue with the TLS handshake to make sure it can’t happen (may not always be possible if the causes are outside of your control)
essen commented 3 months ago

Yes I suppose you could configure tls_handshake_timeout to something other than infinity to avoid the issue of the Gun connection process getting stuck. But I don't know what would happen to the TLS process. Maybe investigating it would tell us more. I think its pid is <0.21289.50> if your node is still up.

Edit: Note that depending on what the issue is, setting a timeout will not help. It depends on which process gets stuck and why.

IgorKoshnarov commented 3 months ago

I think its pid is <0.21289.50> if your node is still up.

I had to restart the application because of complaints, so I can't investigate that process. And currently there is no queues in gun connection processes, as I said it happens occasionally under unknown conditions.