ninenines / ranch

Socket acceptor pool for TCP protocols.
ISC License
1.19k stars 336 forks source link

Restart issues found when upgrading 1.8 -> 2.1.0 #347

Open JeppeTh opened 3 months ago

JeppeTh commented 3 months ago

In case process calling ranch:stop_listener is terminated in the middle of stop we end up in a limbo state. I.e. here simulate that the stopping process is killed after calling supervisor:terminate_child

1> ranch:start_listener(my_server,ranch_tcp,#{},cowboy_tls,#{}). {ok,<0.4006.0>} 2> supervisor:terminate_child(ranch_sup, {ranch_listener_sup, my_server}). ok 3> ranch:stop_listener(my_server). {error,not_found} 4> ranch:info(my_server). ** exception error: bad argument in function ets:lookup_element/3 called as ets:lookup_element(ranch_server,{listener_sup,my_server},2) *** argument 2: not a key that exists in the table in call from ranch_server:get_listener_sup/1 (/local/gsnws/sgsn_mme/3pp/erlang/lib/ranch/src/ranch_server.erl, line 122)
in call from ranch:info/1 (/local/gsnws/sgsn_mme/3pp/erlang/lib/ranch/src/ranch.erl, line 443) 5> ranch:start_listener(my_server,ranch_tcp,#{},cowboy_tls,#{}). {error,already_present}

Most likely introduced here: https://github.com/ninenines/ranch/commit/9765f305e1f55f758a683bd95665fd0ab84a52c3 I.e. before this a 2nd call to ranch:stop_listener performed another full stop. That's not the case any more since ets tables seems cleaned -> stop is aborted by badarg

In another situation I managed to get into the same situation but where ranch:info/0 finds it - but not ranch:info/1. In the above case neiter ranch:info/0 finds its.

JeppeTh commented 3 months ago

Probably the cleanup part should be separated into an try catch of its own. I.e. that data must be fetched before calling terminate_child. Then regardless of its outcome, always perform the legacy parts of stop_listener.

juhlig commented 3 months ago

348 should fix this.

However, in the discussion in the linked commit, you mentioned that you could still "see" the listener in the output from ranch:info/0. This should not be possible, since this depends on the contents of the ranch_server table, which should be cleaned as soon as the process of the listener exits. If you really can find it in the ranch:info/0 output, this suggests a corruption of the data in the table. If you could provide steps to reproduce this scenario, it would be immensely helpful, otherwise I fear that I can't do much about it :man_shrugging:

JeppeTh commented 3 months ago

That other case happened for my test tool - let's see if I get any time and luck reproducing that...

juhlig commented 3 months ago

This may be due to timing then... ranch_server monitors the listener processes, and when it receives a 'DOWN' message clears the related entries from the table. If the message sent by way of ranch:info/1,2 reaches the ranch_server process before the 'DOWN' message does, you will "see" the listener still there.

Put differently, if a call to ranch:info/1,2 does show your listener, another subsequent call may or may not show it. However, if a call to ranch:info/1,2 does not show your listener, another subsequent call must also not show it. (That is, of course, unless there was a call to ranch:start_listener in between).

JeppeTh commented 3 months ago

It wasn't that difficult to reproduce - here's a trace on ranch. Haven't analysed it. trace.log

JeppeTh commented 3 months ago

Where you can see e.g.

12:16:41,878: (me@mymachine,<0.1973.0>) return ranch:info/0 ->

{{myserver,{10,0,4,48},5022} =>

12:17:03,534: (me@mymachine,<0.1980.0>) return ranch:stop_listener/1 -> {error,not_found}

12:17:19,778: (me@mymachine,<0.1983.0>) return ranch:start_listener/5 -> {error,{already_started,<0.1965.0>}}

JeppeTh commented 3 months ago

But perhaps this is not related to the mentioned change. Or at least running the old stop_listener doesn't help. Still stuck in limbo

JeppeTh commented 3 months ago

Sorry this side track was all wrong. I now realized the ranch Ref wasn't just the name it's {Name, Ip, Port} 🙈