rabbitmq / ra

A Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.
Other
813 stars 96 forks source link

`ra:consistent_query` hangs inexplicably #336

Closed orre closed 1 year ago

orre commented 1 year ago

[I'm using RA v2.4.0]

We have a lot of "loss of majority" situations in our RA clusters due to "natural causes" that we cannot affect. I've observed a problem that ra:consistent_query seems to timeout/hang in situations where it should not hang at all according to my (possibly limited) knowledge of RAFT/RA.

It goes so far that it hangs forever (or at least a very long time), when the cluster has perfectly good majority and an elected leader. When this occurs, ra:leader_query always works perfectly fine AND it is possibly to commit to the log. It is only ra:consistent_query that mysteriously hangs.

I have created a "reproducer" repo here NB: hostname (and possibly also your domain) must be adapted in src/ra_kv_store.erl (function: servers()) before running.

Steps to reproduce:

Start 3 erlang nodes

rebar3 shell --name ra1@ratatosk.local
rebar3 shell --name ra2@ratatosk.local
rebar3 shell --name ra3@ratatosk.local

Start cluster from ra2

ra_kv_store:start_cluster().

Run test cycle

ra_kv_store:until_block().
ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).

Repeating the test cycle will eventually block for a long time. Example session

(ra2@ratatosk.local)1> ===> Booted ra_test                                                                                                                                                                                                                                                                                                                                                                                                                                                   [40/50]

(ra2@ratatosk.local)1>                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
(ra2@ratatosk.local)1>                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
(ra2@ratatosk.local)1> ra_kv_store:start_cluster().                                                                                                                                                                                                                                                                                                                                                                                                                                                 
Attempting to communicate with node ra1@ratatosk.local, response: pong                                                                                                                                                                                                                                                                                                                                                                                                                              
Attempting to communicate with node ra2@ratatosk.local, response: pong                                                                                                                                                                                                                                                                                                                                                                                                                              
Attempting to communicate with node ra3@ratatosk.local, response: pong                                                                                                                                                                                                                                                                                                                                                                                                                              
{ok,[{ra_kv,'ra2@ratatosk.local'},                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
     {ra_kv,'ra3@ratatosk.local'},                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
     {ra_kv,'ra1@ratatosk.local'}],                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
    []}                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
(ra2@ratatosk.local)2> ra_kv_store:until_block().                                                                                                                                                                                                                                                                                                                                                                                                                                                   
Going for a round...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
Restarting server  {ra_kv,'ra2@ratatosk.local'}                                                                                                                                                                                                                                                                                                                                                                                                                                                     
Going for a round...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
Restarting server  {ra_kv,'ra2@ratatosk.local'}                                                                                                                                                                                                                                                                                                                                                                                                                                                     
Going for a round...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
Restarting server  {ra_kv,'ra3@ratatosk.local'}                                                                                                                                                                                                                                                                                                                                                                                                                                                     
Going for a round...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
** exception error: no match of right hand side value {error,                                                                                                                                                                                                                                                                                                                                                                                                                                             
                                                       {no_more_servers_to_try,                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                        [{timeout,{ra_kv,'ra3@ratatosk.local'}},                                                                                                                                                                                                                                                                                                                                                                                                    
                                                         {timeout,{ra_kv,'ra3@ratatosk.local'}},                                                                                                                                                                                                                                                                                                                                                                                                    
                                                         {error,noproc}]}}                                                                                                                                                                                                                                                                                                                                                                                                                                
     in function  ra_kv_store:consistent_query/0 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 51)                                                                                                                                                                                                                                                                                                                                                                                                   
     in call from ra_kv_store:maybe_block/0 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 40)                                                                                                                                                                                                                                                                                                                                                                                                        
     in call from ra_kv_store:until_block/1 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 57)                                                                                                                                                                                                                                                                                                                                                                                                        
(ra2@ratatosk.local)3> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).                                                                                                                                                                                                                                                                                                                                                                                                         
{ok,undefined,{ra_kv,'ra3@ratatosk.local'}}                                                                                                                                                                                                                                                                                                                                                                                                                                                         
(ra2@ratatosk.local)4> ra_kv_store:until_block().                                                                                                                                                                                                                                                                                                                                                                                                                                                   
Going for a round...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
Restarting server  {ra_kv,'ra3@ratatosk.local'}                                                                                                                                                                                                                                                                                                                                                                                                                                                     
Going for a round...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
Restarting server  {ra_kv,'ra2@ratatosk.local'}                                                                                                                                                                                                                                                                                                                                                                                                                                                     
Going for a round...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
** exception error: no match of right hand side value {error,                                                                                                                                                                                                                                                                                                                                                                                                                                             
                                                       {no_more_servers_to_try,                                                                                                                                                                                                                                                                                                                                                                                                                           
                                                        [{timeout,{ra_kv,'ra2@ratatosk.local'}},                                                                                                                                                                                                                                                                                                                                                                                                    
                                                         {timeout,{ra_kv,'ra2@ratatosk.local'}},                                                                                                                                                                                                                                                                                                                                                                                                    
                                                         {error,noproc}]}}                                                                                                                                                                                                                                                                                                                                                                                                                                
     in function  ra_kv_store:consistent_query/0 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 51)                   
     in call from ra_kv_store:maybe_block/0 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 40)                        
     in call from ra_kv_store:until_block/1 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 57)                        
(ra2@ratatosk.local)5> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).                         
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'ra2@ratatosk.local'}},                                             
                                {timeout,{ra_kv,'ra2@ratatosk.local'}},                                             
                                {error,noproc}]}}  
(ra2@ratatosk.local)6> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'ra2@ratatosk.local'}},
                                {timeout,{ra_kv,'ra2@ratatosk.local'}},
                                {error,noproc}]}}
(ra2@ratatosk.local)7> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{ok,undefined,{ra_kv,'ra2@ratatosk.local'}}
(ra2@ratatosk.local)8> ra_kv_store:until_block().                                          
Going for a round...
Restarting server  {ra_kv,'ra2@ratatosk.local'}
Going for a round...
** exception error: no match of right hand side value 
                    {error,{no_more_servers_to_try,[{timeout,{ra_kv,'ra2@ratatosk.local'}},
                                                    {timeout,{ra_kv,'ra2@ratatosk.local'}},
                                                    {error,noproc}]}}
     in function  ra_kv_store:consistent_query/0 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 51)
     in call from ra_kv_store:maybe_block/0 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 40)
     in call from ra_kv_store:until_block/1 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 57)
(ra2@ratatosk.local)9> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'ra2@ratatosk.local'}},
                                {timeout,{ra_kv,'ra2@ratatosk.local'}},
                                {error,noproc}]}}
(ra2@ratatosk.local)10> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'ra2@ratatosk.local'}},
                                {timeout,{ra_kv,'ra2@ratatosk.local'}},
                                {error,noproc}]}}
(ra2@ratatosk.local)11> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'ra2@ratatosk.local'}},
                                {timeout,{ra_kv,'ra2@ratatosk.local'}},
                                {error,noproc}]}}
(ra2@ratatosk.local)12> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{ok,undefined,{ra_kv,'ra2@ratatosk.local'}}
(ra2@ratatosk.local)13> ra_kv_store:until_block().                                          
Going for a round...
Restarting server  {ra_kv,'ra2@ratatosk.local'}
Going for a round...
** exception error: no match of right hand side value 
                    {error,{no_more_servers_to_try,[{timeout,{ra_kv,'ra2@ratatosk.local'}},
                                                    {timeout,{ra_kv,'ra2@ratatosk.local'}},
                                                    {error,noproc}]}}
     in function  ra_kv_store:consistent_query/0 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 51)
     in call from ra_kv_store:maybe_block/0 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 40)
     in call from ra_kv_store:until_block/1 (/home/orre/work/ra_test/src/ra_kv_store.erl, line 57)
(ra2@ratatosk.local)14> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'ra2@ratatosk.local'}},
                                {timeout,{ra_kv,'ra2@ratatosk.local'}},
                                {error,noproc}]}}
(ra2@ratatosk.local)15> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'ra2@ratatosk.local'}},
                                {timeout,{ra_kv,'ra2@ratatosk.local'}},
                                {error,noproc}]}}
(ra2@ratatosk.local)16> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'ra2@ratatosk.local'}},
                                {timeout,{ra_kv,'ra2@ratatosk.local'}},
                                {error,noproc}]}}
(ra2@ratatosk.local)17> ra:consistent_query(ra_kv_store:servers(), fun(_) -> undefined end).
{error,{no_more_servers_to_try,[{timeout,{ra_kv,'ra2@ratatosk.local'}},
                                {timeout,{ra_kv,'ra2@ratatosk.local'}},
                                {error,noproc}]}}
(ra2@ratatosk.local)18> ra:leader_query(ra_kv_store:servers(), fun(_) -> undefined end).    
{ok,{{9,8},undefined},{ra_kv,'ra2@ratatosk.local'}}   

So whats going on here? Why is ra:consistent_query blocking?

Thanks Örjan