Leader election after node restart

Excpt0r commented 2 years ago

Your question

I use sofa-jraft for the sole purpose to have a leader election between instances of myapp. One test is to shut down 1 of 3 myapp instances, and observe the other 2 instances elect a new leader - works. After that when restarting the shutdown instance, I would expect it to be re-integrated in the raft cluster as a follower. But I see the following behaviour

Starts FSMCaller successfully.
Do not set snapshot uri, ignore initSnapshotStorage.
LeaderElection/myapp-0.myapp-headless:8888> init, term=0, lastLogId=LogId [index=0, term=0], conf=myapp-0.myapp-headless:8888,myapp-1.myapp-headless:8888,myapp-2.myapp-headless:8888, oldConf=.
myapp-0.myapp-headless:8888> term 0 start preVote

Channel in TRANSIENT_FAILURE state: myapp-1.myapp-headless:8888.
Channel in SHUTDOWN state: myapp-1.myapp-headless:8888.
Peer myapp-1.myapp-headless:8888 is connected.

Channel in TRANSIENT_FAILURE state: myapp-2.myapp-headless:8888.
Channel in SHUTDOWN state: myapp-2.myapp-headless:8888.
Peer myapp-2.myapp-headless:8888 is connected.

received invalid PreVoteResponse from myapp-1.myapp-headless:8888, term 2, expect=0.
LocalRaftMetaStorage | INFO | Save raft meta, path=/tmp/meta, term=2, votedFor=0.0.0.0:0, cost time=12 ms
received invalid PreVoteResponse from myapp-2.myapp-headless:8888, term=0, currTerm=2

myapp-0.myapp-headless:8888> term 2 start preVote.
received PreVoteResponse from myapp-1.myapp-headless:8888, term=2, granted=false.
received PreVoteResponse from myapp-2.myapp-headless:8888, term=2, granted=false.

The last three lines will be repeatet over and over again.

Questions 1) Is a persistent volume and the "snapshot" feature needed, that the restarted instance can be re-integrated in the raft cluster? Currently I don't persist the jraft data and snapshots are disabled, myapp is even deleting any existing data in config-directory during startup (for local environments) 2) What would be the expected behaviour, how the restarted instance is re-integrated and gets the current data from cluster?

Environment

SOFAJRaft version: 1.3.7
JVM version (e.g. java -version): openjdk version "11.0.11"

fengjiachun commented 2 years ago

Can you give a detailed step to reproduce the problem? My test results are as follows:

2021-09-15 15:09:08 [main] INFO  FSMCallerImpl:201 - Starts FSMCaller successfully.
2021-09-15 15:09:08 [main] WARN  NodeImpl:560 - Do not set snapshot uri, ignore initSnapshotStorage.
2021-09-15 15:09:08 [main] INFO  NodeImpl:1085 - Node <counter/127.0.0.1:8083> init, term=0, lastLogId=LogId [index=0, term=0], conf=127.0.0.1:8081,127.0.0.1:8082,127.0.0.1:8083, oldConf=.
2021-09-15 15:09:08 [main] INFO  RaftGroupService:136 - Start the RaftGroupService successfully.
Started counter server at port:8083
2021-09-15 15:09:09 [counter/PeerPair[127.0.0.1:8083 -> 127.0.0.1:8082]-AppendEntriesThread0] INFO  LocalRaftMetaStorage:132 - Save raft meta, path=/tmp/server3/raft_meta, term=6, votedFor=0.0.0.0:0, cost time=11 ms
2021-09-15 15:09:09 [counter/PeerPair[127.0.0.1:8083 -> 127.0.0.1:8082]-AppendEntriesThread0] WARN  NodeImpl:1953 - Node <counter/127.0.0.1:8083> reject term_unmatched AppendEntriesRequest from 127.0.0.1:8082, term=6, prevLogIndex=999, prevLogTerm=6, localPrevLogTerm=0, lastLogIndex=0, entriesSize=0.
2021-09-15 15:09:09 [JRaft-FSMCaller-Disruptor-0] INFO  StateMachineAdapter:89 - onStartFollowing: LeaderChangeContext [leaderId=127.0.0.1:8082, term=6, status=Status[ENEWLEADER<10011>: Raft node receives message from new leader with higher term.]].
2021-09-15 15:09:09 [counter/PeerPair[127.0.0.1:8083 -> 127.0.0.1:8082]-AppendEntriesThread0] WARN  NodeImpl:1953 - Node <counter/127.0.0.1:8083> reject term_unmatched AppendEntriesRequest from 127.0.0.1:8082, term=6, prevLogIndex=999, prevLogTerm=6, localPrevLogTerm=0, lastLogIndex=0, entriesSize=0.
2021-09-15 15:09:09 [counter/PeerPair[127.0.0.1:8083 -> 127.0.0.1:8082]-AppendEntriesThread0] WARN  NodeImpl:1953 - Node <counter/127.0.0.1:8083> reject term_unmatched AppendEntriesRequest from 127.0.0.1:8082, term=6, prevLogIndex=999, prevLogTerm=6, localPrevLogTerm=0, lastLogIndex=0, entriesSize=0.
2021-09-15 15:09:10 [counter/PeerPair[127.0.0.1:8083 -> 127.0.0.1:8082]-AppendEntriesThread0] INFO  Recyclers:52 - -Djraft.recyclers.maxCapacityPerThread: 4096.
2021-09-15 15:09:10 [JRaft-FSMCaller-Disruptor-0] INFO  CounterStateMachine:102 - Added value=0 by delta=106 at logIndex=3
2021-09-15 15:09:10 [JRaft-FSMCaller-Disruptor-0] INFO  CounterStateMachine:102 - Added value=106 by delta=5 at logIndex=4
2021-09-15 15:09:10 [JRaft-FSMCaller-Disruptor-0] INFO  CounterStateMachine:102 - Added value=111 by delta=15 at logIndex=5

killme2008 commented 2 years ago

You should keep jraft data in persistent storage,the data contains the commit logs and metadata, if they corrupts, the node will copied logs from leader and replay them to recover the state machine .The snapshot helps speeding up the recover procedure.

Excpt0r commented 2 years ago

Hi,

thanks for your replies! Sorry, I forgot to mention that the problem is not often reproducible.

@killme2008 is it much data that needs to be replayed, when I only use leader election and no KV store? Will deleting the old raft on startup only slow down the sync, or could that be the reason for the observed problem?

The reason I decided to delete the raft data on startup is, that I thought it might lead to issues, when all cluster nodes are shutdown intentionally, and started again (with previous raft data). I made the experience that it's not easy to get stateful deployments (i.e. etcd, zookeeper) back in correct state with previous data. Just wanted to avoid such cases, and any special handling that might be needed.

killme2008 commented 2 years ago

Please attach the new leader's log. I think it would help us to find out the problem.

There is no much logs if you just use the jraft for election,but everytime you delete the data that may hurt election speed.

Excpt0r commented 2 years ago

Hi,

I have reproduced the problem and fetched the logs of all election nodes. After initial successful election, device-sub-1 became leader - this is the node that was then killed at 2021-09-28T11:20:45 The node device-sub-2 became new leader. The logs attached for node device-sub-1 is after killing, while restarting and unsuccessfully trying to reconnect to the cluster.

election-log.tar.gz

fengjiachun commented 2 years ago

How did you kill the process? From the log information, it seems that device-sub-1 did not start rpcServer successfully, so it can not receive leader‘s messages, but can send messages to others using GRPC client (vote messages).

Because the time range contained in the log is limited, I do not know whether device-sub-1 finally returns to normal successfully.

Can you check the health of the rpcServer in this case? For example, whether the port is accessible or whether a client successfully connects to it， etc..

netstat is helpful if you are work on a linux.

Excpt0r commented 2 years ago

Hi @fengjiachun

all services run on kubernetes on virtual machines based on KVM. My test is executing a "virsh destroy" (like unplugging the machine, not a clean shutdown) on the worker node that contains the current election leader. I analysed the rpc server connectivity as you suggested, as the new leader is not able to connect to the old leader that restarted. I can connect to the restarted node from leader nodes container via telnet, but the running election process cannot connect. -> Reason: the election node (grpc) tries to connect to the old IP (killing and restarting the old leader leads to a new pod IP)

Actually that is a problem that I already had with grpc-java in a different service (see https://github.com/grpc/grpc-java/issues/8574) Hopefully there will be a suggestion on how to change my grpc config or a bugfix, that can be applied to jraft as well.

Another option would be to switch back from grpc to bolt. Is the bugfix for bolt that you mentioned in this comment already implemented and released? https://github.com/sofastack/sofa-jraft/issues/599#issuecomment-851786047 Beside the bugfix, would there be any advantage or disadvantage to use bolt vs grpc, in my scenario where only a leader election functionality is needed?

Excpt0r commented 2 years ago

In the linked issue for grpc-java the real cause seems to be how jetcd is using grpc (with an own dns resolver, not grpc). So I gathered more logs from jraft/grpc to see how dns resolving is done. I can see the log message "Resolved address" only at the application startup (see attachment), not after failed connections.

Having a look at jraft-extension/rpc-grpc-impl/src/main/java/com/alipay/sofa/jraft/rpc/impl/GrpcClient.java, it seems that in getChannel() the grpc channel is constructed based on IP. So to detect a changed DNS entry, jraft would need to update the endpoints/IPs, right?

jraft-grpc.log

fengjiachun commented 2 years ago

I have an idea: Delete the channel when it becomes TRANSIENT_FAILURE and reinitialize the connection the next time it is called.

What do you think? @Excpt0r @killme2008

fengjiachun commented 2 years ago

Is the bugfix for bolt that you mentioned in this comment already implemented and released?

The bug fixed at 1.6.4

killme2008 commented 2 years ago

I have an idea: Delete the channel when it becomes TRANSIENT_FAILURE and reinitialize the connection the next time it is called.

What do you think? @Excpt0r @killme2008

I think it will work. But jvm caches DNS name lookups, we must recommend user to set it to a smaller value by networkaddress.cache.ttl.

fengjiachun commented 2 years ago

I have an idea: Delete the channel when it becomes TRANSIENT_FAILURE and reinitialize the connection the next time it is called.

What do you think? @Excpt0r @killme2008

channel#resetConnectBackoff is equivalent to reinitializing a connection #690

Excpt0r commented 2 years ago

Hi @fengjiachun

I think I found another occurrence of the described issue, even though it seemed to be fixed. To rule out a problem on application/env level, I used jraft with bolt rpc instead grpc, and the problem does not appear.

In the attached logs, I did the following:

Kill the node with pod x-device-sub-1 (current raft leader) Election works and x-device-sub-2 is voted new leader
Start the node again, pod x-device-sub-1 is restarted (new IP 10.130.0.9) At 15:57:13h x-device-sub-1 is connected to other peers device-sub-2 and device-sub-0 At the same time, we see at device-sub-2 that the old IP is referenced for device-sub-0 heartbearts (10.130.0.21) At 15:58:13h (one minute later), device-sub-2 still shows this IP in the logs. We also see at device-sub-0: ignore PreVoteRequest from van-device-sub-1.van-device-sub-headless:8888, term=1, currTerm=3, because the leader van-device-sub-2.van-device-sub-headless:8888's lease is still valid.

Maybe I will keep bolt rpc in favor of grpc, but it's probably good to track the problem anyway.

Runtime is OpenJdk 11.0.11 jraft v1.3.9 java.security: networkaddress.cache.ttl=10 networkaddress.cache.negative.ttl=10

Cheers and thanks.

logs.zip

fengjiachun commented 2 years ago

I haven't found the exact reason, but can you try to make sure this IP is not re-enabled (10.130.0.21), because from the logs, this IP still exists and the kernel responds to the RST. gRPC doesn't change the channel state to TRANSIENT_FAILURE when it received a RST (which I'm also rather confused about).

| 2022-02-21T15:56:49,620 | grpc-default-worker-ELG-3-1 | io.grpc.netty.shaded.io.grpc.netty.NettyClientHandler | DEBUG | [id: 0x174b8491, L:/10.129.2.34:46478 - R:x-device-sub-1.x-device-sub-headless/10.130.0.21:8888] OUTBOUND RST_STREAM: streamId=691 errorCode=8

Excpt0r commented 2 years ago

Hi,

the application runs on kubernetes, which means it has virtual networking on container level + networking on VM level. I don't think I can influence the behaviour of that stack in regards to handling of previous pod IPs.

I'm wondering where the "OUTBOUND RST_STREAM" comes from, as reaction to what? We see a prevote request/response between the restarted leader device-sub-1 and the follower device-sub-0. But I don't see that between device-sub-1 and device-sub-2, which sends the OUTBOUND RST_STREAM.

I'm not sure if the IP still exists and device-sub-2 received a (TCP) RST. This RST message looks more like something that grpc is doing, and the direction is outgoing, where is the kernel involved?

sofastack / sofa-jraft

Leader election after node restart #683

Your question

Environment