tidwall / summitdb

In-memory NoSQL database with ACID transactions, Raft consensus, and Redis API
Other
1.41k stars 78 forks source link

missing server means new leader complains forever; needs to avoid spamming its logs #13

Closed glycerine closed 7 years ago

glycerine commented 7 years ago

checking on the raft fault tolerance functionality, at 56ec0609e35bc528c2789854038cbdb675e62e97

in terminal0:
 ./summitdb-server                                                                          

in terminal1:
 ./summitdb-server -p 7482 -dir data2 -join :7481                                           

in terminal2:
 summitdb-server -p 7483 -dir data3 -join :7482                                             

kill term0 server.       term1 takes over.  

of concern: newly elected leader will complain forever about not being able to contact the term0 server on port 7481. This server may be permanently gone. It is pointless to fill up the logs with useless chatter.

even starting a new third server with:
./summitdb-server -p 7484 -dir data4 -join :7482

so that now the leader knows about full bank of 3 servers, but it still complains about not being able to reach 7481. log space is massively wasted with pages and pages of:

90632:N 18 Jan 23:59:35.457 # Failed to heartbeat to :7481: dial tcp :7481: getsockopt: con\
nection refused                                                                             
90632:N 18 Jan 23:59:43.644 # Failed to AppendEntries to :7481: dial tcp :7481: getsockopt:\
 connection refused                                                                         
90632:N 18 Jan 23:59:45.849 # Failed to heartbeat to :7481: dial tcp :7481: getsockopt: con\
nection refused                                                                             
90632:N 18 Jan 23:59:53.942 # Failed to AppendEntries to :7481: dial tcp :7481: getsockopt:\
 connection refused                                                                         
90632:N 18 Jan 23:59:56.288 # Failed to heartbeat to :7481: dial tcp :7481: getsockopt: con\
nection refused                                                                             
90632:N 19 Jan 00:00:04.256 # Failed to AppendEntries to :7481: dial tcp :7481: getsockopt:\
 connection refused                                                                         
90632:N 19 Jan 00:00:06.715 # Failed to heartbeat to :7481: dial tcp :7481: getsockopt: con\
nection refused                                                                             
90632:N 19 Jan 00:00:14.556 # Failed to AppendEntries to :7481: dial tcp :7481: getsockopt:\
 connection refused                                                                         
90632:N 19 Jan 00:00:17.150 # Failed to heartbeat to :7481: dial tcp :7481: getsockopt: con\
nection refused 

It seems fine to complain a couple of times. But once the new leader gets the same server count back, it should certainly be quiet about loosing an old node.

tidwall commented 7 years ago

As I understand, this is normal behavior with Raft. The cluster will continually look for the node until the node is explicitly removed. This is the boilerplate functionality from the hashicorp/raft library.

Summit has the RAFTREMOVEPEER command that will force remove the node.

RAFTREMOVEPEER :7483

The log should quiet down shortly after.

glycerine commented 7 years ago

Thanks Josh. That clarifies the situation.