Too many timeout gen_leader leader_call error raised even though nodes are still connected.

ravindrapai commented 8 years ago

Hi, I am using gproc in distributed mode. There are two erlang VM nodes running in two separate machine in local network with 8CPU, 32GB machine and 1Gbps link between them. My gproc is at latest commit master, using gen_leader_revival garret simith.

In my test scenario like a chat room, 1) 1000 connections per second made to server. 2) 2 process register to a single room by going gproc:reg({p, g, <<"1">>}). 3) Like this, I have 200k process registered to 100k rooms, 2 process one room. 4) After 1 hour, all connections closed almost simultaneously, resulting in flood of gproc:unreg.

This results in a lot of timeout gen_leader leader_call error, as below. =ERROR REPORT==== 25-Sep-2015::09:44:32 === Ranch listener http had connection process started with cowboy_protocol:start_link/4 at <0.4876.0> exit with reason: [{reason,{timeout,{gen_leader,leader_call,[gproc_dist,{unreg,{p, g,{ws,<<"140000">>}},<0.4876.0>}]}}},{mfa,{ws_handler,websocket_terminate,3}},{stacktrace,[{gen_leader,leader_call,2,[{file,"src/gen_leader.erl"},{line,359}]},{gproc_dist,leader_cal l,1,[{file,"src/gproc_dist.erl"},{line,831}]},{gproc,unreg,1,[{file,"src/gproc.erl"},{line,1169}]},{ws_handler,websocket_terminate,3,[{file,"src/ws_handler.erl"},{line,61}]},{cowboy _websocket,handler_terminate,4,[{file,"src/cowboy_websocket.erl"},{line,750}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,442}]}]},{req,[{socket,#Port<0.5285

},{transport,ranch_tcp},{connection,keepalive},{pid,<0.4876.0>},{method,<<"GET">>},{version,'HTTP/1.1'},{peer,{{X,X,X,X},17153}},{host,<<"X.X.X.X">>},{host_info,undefined },{port,80},{path,<<"/websocket/140000">>},{path_info,undefined},{qs,<<>>},{qs_vals,undefined},{bindings,[{roomname,<<"140000">>}]},{headers,[{<<"upgrade">>,<<"WebSocket">>},{<<"con nection">>,<<"Upgrade">>},{<<"host">>,<<"X.X.X.X">>},{<<"sec-websocket-key">>,<<"x3JJHMbDL1EzLkh9GBhXDw==">>},{<<"sec-websocket-version">>,<<"13">>},{<<"origin">>,<<"http://17 2.31.17.134/">>}]},{p_headers,[{<<"upgrade">>,[<<"websocket">>]},{<<"connection">>,[<<"upgrade">>]}]},{cookies,undefined},{meta,[{websocket_version,13},{websocket_compress,false}]}, {body_state,waiting},{buffer,<<>>},{multipart,undefined},{resp_compress,false},{resp_state,done},{resp_headers,[]},{resp_body,<<>>},{onresponse,undefined}]},{state,{state,<<"140000"

,<0.725.0>}},{terminate_reason,{error,closed}}]

=ERROR REPORT==== 25-Sep-2015::09:44:32 === Ranch listener http had connection process started with cowboy_protocol:start_link/4 at <0.4875.0> exit with reason: [{reason,{timeout,{gen_leader,leader_call,[gproc_dist,{unreg,{p, g,{ws,<<"139997">>}},<0.4875.0>}]}}},{mfa,{ws_handler,websocket_terminate,3}},{stacktrace,[{gen_leader,leader_call,2,[{file,"src/gen_leader.erl"},{line,359}]},{gproc_dist,leader_cal l,1,[{file,"src/gproc_dist.erl"},{line,831}]},{gproc,unreg,1,[{file,"src/gproc.erl"},{line,1169}]},{ws_handler,websocket_terminate,3,[{file,"src/ws_handler.erl"},{line,61}]},{cowboy _websocket,handler_terminate,4,[{file,"src/cowboy_websocket.erl"},{line,750}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,442}]}]},{req,[{socket,#Port<0.5284

},{transport,ranch_tcp},{connection,keepalive},{pid,<0.4875.0>},{method,<<"GET">>},{version,'HTTP/1.1'},{peer,{{X,X,X,X},65511}},{host,<<"X.X.X.X">>},{host_info,undefined },{port,80},{path,<<"/websocket/139997">>},{path_info,undefined},{qs,<<>>},{qs_vals,undefined},{bindings,[{roomname,<<"139997">>}]},{headers,[{<<"upgrade">>,<<"WebSocket">>},{<<"con nection">>,<<"Upgrade">>},{<<"host">>,<<"X.X.X.X">>},{<<"sec-websocket-key">>,<<"x3JJHMbDL1EzLkh9GBhXDw==">>},{<<"sec-websocket-version">>,<<"13">>},{<<"origin">>,<<"http://17 2.31.17.134/">>}]},{p_headers,[{<<"upgrade">>,[<<"websocket">>]},{<<"connection">>,[<<"upgrade">>]}]},{cookies,undefined},{meta,[{websocket_version,13},{websocket_compress,false}]}, {body_state,waiting},{buffer,<<>>},{multipart,undefined},{resp_compress,false},{resp_state,done},{resp_headers,[]},{resp_body,<<>>},{onresponse,undefined}]},{state,{state,<<"139997"

,<0.629.0>}},{terminate_reason,{error,closed}}]

What I see some calls succeed and some fail. After test completion, leader node ets table return to initial state, (8084@ip-X-X-X-X.internal)60> ets:info(gproc). [{read_concurrency,true}, {write_concurrency,false}, {compressed,false}, {memory,99}, {owner,<0.260.0>}, {heir,none}, {name,gproc}, {size,1}, {node,'8084@ip-X-X-X-X.internal'}, {named_table,true}, {type,ordered_set}, {keypos,1}, {protection,public}]

While other connected node goes into stale state, (8082@ip-X-X-X-X.internal)16> ets:info(gproc). [{read_concurrency,true}, {write_concurrency,false}, {compressed,false}, {memory,11311097}, {owner,<0.192.0>}, {heir,none}, {name,gproc}, {size,550818}, {node,'8082@ip-X-X-X-X.internal'}, {named_table,true}, {type,ordered_set}, {keypos,1}, {protection,public}]

Both node return 8084 as leader node, which means above isn't a net-split scenario.

(8082@ip-X-X-X-X.internal)15> gproc_dist:get_leader(). '8084@ip-X-X-X-X.internal'

(8084@ip-X-X-X-X.internal)61> gproc_dist:get_leader(). '8084@ip-X-X-X-X.internal'

I wanted to know, how I could do safe unregister for a process or with timeout?

uwiger commented 8 years ago

Currently, there's no way to do this. I agree it would be nice feature to have.

ravindrapai commented 8 years ago

Hi uwiger, thanks for replying.

I also noticed in the above case, when I do gproc:unreg returns true at slave machine but doesn't delete the key in ETS. Same key deleted in Leader machine. ets:tab2list(grpoc). {{<0.486.0>,{n,g,{session,<<"9985">>}}},[]}, {{<0.486.0>,{p,g,<<"9985">>}},[]},

So would it be a good idea to put some kind of background process which does, gproc:where and gproc:lookup_pids for stale keys in the slave machine and delete the keys stored in ets locally if its returned undefined and empty list.

By doing above, will this result in a data inconsistency(race condition)? I am relatively new to erlang, I am afraid, if I don't delete stale keys I might run out of memory.

uwiger commented 8 years ago

It's of course not an Erlang-specific issue that stale data remaining in memory can cause problems. ;-)

I'll see if I can figure out why the replication sometimes fails, and also think about whether some tricks might be applied to make gproc more resilient to e.g. unreg storms.

ravindrapai commented 8 years ago

FYI, this happens even if two Erlang VM are running in a single machine, so I guess its not a network bandwidth or latency problem. Just huge number of unreg storms resulting in stale keys.

ravindrapai commented 8 years ago

Hi uwiger, did you had any chance to look into the issue? I tried out uw-delete-globals branch that you had committed for issue #13, above problem is still there.

ravindrapai commented 8 years ago

Hi uwiger, Comment with code highlighted, I have setup a background process to cleanup the stale keys, would this be a good strategy. Please find the code below.

gproc_scheduled_cleaner() ->
    % scheduled cleaner needed for now during gproc unreg strom
    % gproc slave left with stale keys.
    % for more info look at this bug
    % https://github.com/uwiger/gproc/issues/98
    receive
        stop ->
            ok
    after
        3600000 ->
            GPORC_LIST = ets:tab2list(gproc),
            lists:foreach(fun(Entry) -> process_entry(Entry) end, GPORC_LIST),
            gproc_scheduled_cleaner()
    end.

process_entry(Entry) ->
    TupleSize = erlang:tuple_size(Entry),
    if  
        TupleSize == 1 ->
            {{Pid, _}} = Entry,
            process_pid(Pid, Entry);
        TupleSize == 2 ->
            {{Pid, _}, _} = Entry,
            process_pid(Pid, Entry);
        TupleSize == 3 ->
            { _, Pid, _ } = Entry,
            process_pid(Pid, Entry);
        true ->
            ok
    end.

process_pid(Pid, Entry) ->
    % check whether process is alive, if dead delete the object in gproc table.
    MonitorRef = erlang:monitor(process, Pid),
    erlang:demonitor(MonitorRef),
    receive
        {'DOWN', MonitorRef, process, Pid, _} ->
            ets:delete_object(gproc, Entry)
    after
        0 ->
            ok
    end.

uwiger / gproc

Too many timeout gen_leader leader_call error raised even though nodes are still connected. #98