seaweedfs / seaweedfs

SeaweedFS is a fast distributed storage system for blobs, objects, files, and data lake, for billions of files! Blob store has O(1) disk seek, cloud tiering. Filer supports Cloud Drive, cross-DC active-active replication, Kubernetes, POSIX FUSE mount, S3 API, S3 Gateway, Hadoop, WebDAV, encryption, Erasure Coding.
https://seaweedfs.com
Apache License 2.0
22.27k stars 2.25k forks source link

topo leader: leader not selected yet #4689

Open ginkel opened 1 year ago

ginkel commented 1 year ago

Describe the bug After this morning's upgrade to 3.54 the seaweed masters of our three-node cluster no longer successfully elects a leader:

I0717 20:32:19.311663 masterclient.go:201 .master masterClient failed to receive from 10.147.254.1:9333: rpc error: code = Unknown desc = raft.Server: Not current leader
I0717 20:32:19.313555 masterclient.go:209 master 10.147.254.2:9333 redirected to leader 10.147.254.1:9333
E0717 20:32:19.417173 master_grpc_server.go:334 topo leader: leader not selected yet
W0717 20:32:21.083170 master_grpc_server.go:112 SendHeartbeat find leader: leader not selected yet
E0717 20:32:27.020374 master_grpc_server.go:334 topo leader: leader not selected yet
W0717 20:32:27.811161 master_grpc_server.go:112 SendHeartbeat find leader: leader not selected yet
E0717 20:32:29.881027 master_grpc_server.go:334 topo leader: leader not selected yet
W0717 20:32:32.328074 master_grpc_server.go:112 SendHeartbeat find leader: leader not selected yet
E0717 20:32:51.281045 master_grpc_server.go:334 topo leader: leader not selected yet
E0717 20:32:52.086423 master_grpc_server.go:334 topo leader: leader not selected yet
E0717 20:32:54.946297 master_grpc_server.go:334 topo leader: leader not selected yet

Restarting the master processes once unfortunately did not resolve the issue. A second restart did, but I'd expect the cluster to converge on its own.

System Setup

Expected behavior After updating the seaweedfs Docker containers the cluster is coming back up again.

Screenshots n/a

Additional context Add any other context about the problem here.

chrislusf commented 1 year ago

I do not recall any changes related to this. What is the last known good version?

chrislusf commented 1 year ago

I ran the make cluster under docker/ folder. It is still working as expected.

ginkel commented 1 year ago

Thanks for your speedy response!

Regarding the last-known-good version: I remember that the cluster experienced the same issue during the last upgrade, which I could also remedy using a restart of all processes.

I'll also add some context regarding our setup (sorry for the lack of detail in the original issue report - it was late at night and I was somewhat glad that I had been able to bring the cluster back in shape):

We are running SeaweedFS on a three-node bare metal cluster using Docker (separate containers for master, volume, filer, s3). They are connected using a private network based on Wireguard. The containers are based on your "official" Docker image using the latest tag and automatically updated using Watchtower. That means that there is no explicit coordinaton during the upgrade and containers are updated mostly at random during a ten-minute window. This is when the instability began.

If you have any ideas how I can help pinpoint the problem's cause, I am open to input.

chrislusf commented 1 year ago

maybe have some steps, or a docker compose file, to reproduce this?

ginkel commented 1 year ago

I tried building a reproducer based on the docker-compose from this repo (extended to three master nodes) and updated via watchtower, to match our pro setup as closely as possible.

Unfortunately, I haven't been able to reproduce the issue so far.

Are there any factors that influence how long the master process takes to start up (such as number of allocated volumes) that I may not have modeled correctly in my artificial repro setup?

geekboood commented 9 months ago

I encountered the same issue here, and it happend during the k8s master statefulset rolling upgrade (with total 3 pods). It could be solved by deleting all master pods all at once (restarting all master processes at once). Not sure why this happened, but I observed that the http server or grpc server may not respond normally.

ginkel commented 8 months ago

A few hours ago, all SeaweedFS docker containers were updated in our cluster following the recent release.

Unfortunately, the cluster did not come back up as the three master nodes failed to form a consensus. The root cause seems to be:

I0108 08:44:55.415683 file_util.go:27 Folder /data Permission: -rwxr-xr-x
I0108 08:44:55.415741 master.go:269 current: 10.147.254.1:9333 peers:10.147.254.2:9333,10.147.254.3:9333
I0108 08:44:55.415823 master_server.go:127 Volume Size Limit is 32 MB
I0108 08:44:55.416007 master.go:150 Start Seaweed Master 30GB 3.61 8ae00e47a at 0.0.0.0:9333
I0108 08:44:55.416069 raft_server.go:118 Starting RaftServer with 10.147.254.1:9333
I0108 08:44:55.647119 raft_server.go:167 current cluster leader: 
I0108 08:45:10.718926 master_server.go:215 [10.147.254.1:9333]  - is the leader.
I0108 08:45:10.719396 master.go:201 Start Seaweed Master 30GB 3.61 8ae00e47a grpc server at 0.0.0.0:19333
I0108 08:45:12.340698 masterclient.go:149 connect to 10.147.254.2:9333: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0108 08:45:12.461696 masterclient.go:149 connect to 10.147.254.3:9333: rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0108 08:45:12.461726 masterclient.go:156 No existing leader found!
I0108 08:45:12.461733 raft_server.go:189 Initializing new cluster
I0108 08:45:12.461767 master_server.go:174 leader change event:  => 10.147.254.1:9333
I0108 08:45:12.461784 master_server.go:177 [10.147.254.1:9333] 10.147.254.1:9333 becomes leader.
I0108 08:45:12.464279 master_server.go:174 leader change event: 10.147.254.1:9333 => 
I0108 08:45:28.892161 master_server.go:174 leader change event:  => 10.147.254.1:9333
I0108 08:45:28.892185 master_server.go:177 [10.147.254.1:9333] 10.147.254.1:9333 becomes leader.
I0108 08:45:28.985814 masterclient.go:210 master 10.147.254.2:9333 redirected to leader 10.147.254.1:9333
I0108 08:45:28.986860 master_grpc_server.go:349 + client .master@10.147.254.1:9333
I0108 08:45:28.999259 master_grpc_server.go:349 + client .filer@10.147.254.2:8888
I0108 08:45:28.999462 masterclient.go:247 + master@10.147.254.1:9333 noticed .filer 10.147.254.2:8888
panic: raft: Index is beyond end of log: 3 1252

goroutine 194 [running]:
github.com/seaweedfs/raft.(*Log).getEntriesAfter(0xc000534460, 0x4e4, 0x7d0)
        /go/pkg/mod/github.com/seaweedfs/raft@v1.1.3/log.go:256 +0x765
github.com/seaweedfs/raft.(*Peer).flush(0xc000a02400)
        /go/pkg/mod/github.com/seaweedfs/raft@v1.1.3/peer.go:179 +0xc5
github.com/seaweedfs/raft.(*Peer).heartbeat(0xc000a02400, 0x0?)
        /go/pkg/mod/github.com/seaweedfs/raft@v1.1.3/peer.go:167 +0x1c6
github.com/seaweedfs/raft.(*Peer).startHeartbeat.func1()
        /go/pkg/mod/github.com/seaweedfs/raft@v1.1.3/peer.go:100 +0x59
created by github.com/seaweedfs/raft.(*Peer).startHeartbeat in goroutine 76
        /go/pkg/mod/github.com/seaweedfs/raft@v1.1.3/peer.go:98 +0x116

The other nodes just report topo leader: leader not selected yet.

The panic happened on 10.147.254.1.

Werberus commented 7 months ago

Hi @chrislusf I'm getting a similar problem. Sometimes after restarting the StatefulSet in k8s (pods are restarted sequentially, starting with the last one), the Seaweedfs system remains without an active leader. If I understand correctly, then the masters cannot gather consensus because Masters 1 and 2 consider Master 0 the leader. Master 0 reports that no leader has been selected. Using release: 3.59

Master-2:

I0122 10:47:22.701666 file_util.go:27 Folder /tmp Permission: -rwxrwxrwx
I0122 10:47:22.701728 master.go:269 current: seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 peers:seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333,seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333,seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:47:22.701793 master_server.go:127 Volume Size Limit is 30720 MB
I0122 10:47:22.701961 master.go:150 Start Seaweed Master 8000GB 3.59 27b34f379 at seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:47:22.703613 raft_server.go:118 Starting RaftServer with seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:47:22.705523 raft_server.go:167 current cluster leader: 
I0122 10:47:35.856231 master_server.go:215 [seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333]  - is the leader.
I0122 10:47:35.857871 master.go:201 Start Seaweed Master 8000GB 3.59 27b34f379 grpc server at seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:19333
I0122 10:47:35.862577 masterclient.go:247 + master@seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 noticed .master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:47:36.491762 master_server.go:174 leader change event:  => seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:47:36.491877 master_server.go:177 [seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333] seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333 becomes leader.
I0122 10:47:45.753538 masterclient.go:249 - master@seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 noticed .master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:48:30.632838 masterclient.go:247 + master@seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 noticed .master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:48:46.690386 masterclient.go:225 .master masterClient failed to receive from seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333: rpc error: code = Unavailable desc = error reading from server: EOF
I0122 10:48:46.700570 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:48:46.704131 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:49:07.708519 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:49:07.712137 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:49:08.716211 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:49:08.717889 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:49:09.724137 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:49:09.725990 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:49:10.729051 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:49:10.730693 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:49:11.734903 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:49:11.736706 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:49:12.741466 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:49:12.743285 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:49:45.025515 masterclient.go:202 .master masterClient failed to receive from seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333: rpc error: code = Unknown desc = raft.Server: Not current leader
I0122 10:49:45.029111 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:50:20.505485 masterclient.go:202 .master masterClient failed to receive from seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333: rpc error: code = Unknown desc = raft.Server: Not current leader
I0122 10:50:20.507376 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:50:56.346676 masterclient.go:202 .master masterClient failed to receive from seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333: rpc error: code = Unknown desc = raft.Server: Not current leader
I0122 10:50:59.335556 masterclient.go:202 .master masterClient failed to receive from seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333: rpc error: code = Unavailable desc = error reading from server: EOF
I0122 10:50:59.342087 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:50:59.344114 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:00.349951 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:00.354801 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:01.358481 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:02.364186 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:03.369979 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:04.377422 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:05.383871 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:06.388613 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:07.394273 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:08.401567 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:09.407844 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:10.413575 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:11.418572 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:12.423481 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:13.430295 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:14.434177 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:15.442421 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:49.003968 masterclient.go:202 .master masterClient failed to receive from seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333: rpc error: code = Unknown desc = raft.Server: Not current leader

Master-1

I0122 10:48:14.517132 file_util.go:27 Folder /tmp Permission: -rwxrwxrwx
I0122 10:48:14.517186 master.go:269 current: seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 peers:seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333,seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333,seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:48:14.517249 master_server.go:127 Volume Size Limit is 30720 MB
I0122 10:48:14.517381 master.go:150 Start Seaweed Master 8000GB 3.59 27b34f379 at seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:48:14.518944 raft_server.go:118 Starting RaftServer with seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:48:14.520382 raft_server.go:167 current cluster leader: 
I0122 10:48:30.627759 master_server.go:215 [seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333]  - is the leader.
I0122 10:48:30.629767 master.go:201 Start Seaweed Master 8000GB 3.59 27b34f379 grpc server at seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:19333
I0122 10:48:30.635656 masterclient.go:247 + master@seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 noticed .master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:48:32.385529 master_server.go:174 leader change event:  => seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:48:32.385550 master_server.go:177 [seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333] seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333 becomes leader.
I0122 10:48:46.690417 masterclient.go:249 - master@seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 noticed .filer seaweedfs1-test-filer-0.seaweedfs1-test-filer-peer.seaweedfs-test:8888
I0122 10:48:46.690467 masterclient.go:249 - master@seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 noticed .master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:48:46.690483 masterclient.go:249 - master@seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 noticed .filer seaweedfs1-test-filer-8.seaweedfs1-test-filer-peer.seaweedfs-test:8888
I0122 10:48:46.690788 masterclient.go:249 - master@seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 noticed .filer seaweedfs1-test-filer-6.seaweedfs1-test-filer-peer.seaweedfs-test:8888
I0122 10:48:46.690811 masterclient.go:249 - master@seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 noticed .filer seaweedfs1-test-filer-7.seaweedfs1-test-filer-peer.seaweedfs-test:8888
I0122 10:48:46.691865 masterclient.go:225 .master masterClient failed to receive from seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333: rpc error: code = Unavailable desc = error reading from server: EOF
I0122 10:48:46.702370 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:48:46.704444 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
W0122 10:48:51.694796 master_grpc_server.go:102 SendHeartbeat.Recv: rpc error: code = Canceled desc = context canceled
W0122 10:48:51.694803 master_grpc_server.go:102 SendHeartbeat.Recv: rpc error: code = Canceled desc = context canceled
W0122 10:48:51.702929 master_grpc_server.go:102 SendHeartbeat.Recv: rpc error: code = Canceled desc = context canceled
I0122 10:49:07.708062 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:49:07.711115 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:49:08.714516 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:49:45.299657 masterclient.go:202 .master masterClient failed to receive from seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333: rpc error: code = Unknown desc = raft.Server: Not current leader
I0122 10:49:45.301897 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
W0122 10:49:51.980234 master_grpc_server.go:102 SendHeartbeat.Recv: rpc error: code = Canceled desc = context canceled
W0122 10:49:52.212807 master_grpc_server.go:102 SendHeartbeat.Recv: rpc error: code = Canceled desc = context canceled
W0122 10:49:53.448388 master_grpc_server.go:102 SendHeartbeat.Recv: rpc error: code = Canceled desc = context canceled
I0122 10:50:21.271258 masterclient.go:202 .master masterClient failed to receive from seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333: rpc error: code = Unknown desc = raft.Server: Not current leader
I0122 10:50:21.274840 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:50:55.509979 masterclient.go:202 .master masterClient failed to receive from seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333: rpc error: code = Unknown desc = raft.Server: Not current leader
I0122 10:50:59.334319 masterclient.go:202 .master masterClient failed to receive from seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333: rpc error: code = Unavailable desc = error reading from server: EOF
I0122 10:50:59.335682 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:50:59.343000 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 10:51:00.345846 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
W0122 10:51:04.092743 master_grpc_server.go:102 SendHeartbeat.Recv: rpc error: code = Canceled desc = context canceled
W0122 10:51:04.336710 master_grpc_server.go:102 SendHeartbeat.Recv: rpc error: code = Canceled desc = context canceled

Master-0

I0122 11:00:48.127972 file_util.go:27 Folder /tmp Permission: -rwxrwxrwx
I0122 11:00:48.128030 master.go:269 current: seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333 peers:seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333,seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333,seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 11:00:48.128104 master_server.go:127 Volume Size Limit is 30720 MB
I0122 11:00:48.128248 master.go:150 Start Seaweed Master 8000GB 3.59 27b34f379 at seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 11:00:48.129681 raft_server.go:118 Starting RaftServer with seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 11:00:48.131081 raft_server.go:167 current cluster leader: 
I0122 11:01:04.881215 master_server.go:215 [seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333]  - is the leader.
I0122 11:01:04.882598 master.go:201 Start Seaweed Master 8000GB 3.59 27b34f379 grpc server at seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:19333
I0122 11:01:04.884655 masterclient.go:210 master seaweedfs1-test-master-1.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
I0122 11:01:06.386088 masterclient.go:152 existing leader is seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
E0122 11:01:30.666918 master_grpc_server.go:334 topo leader: leader not selected yet
E0122 11:01:31.623042 master_grpc_server.go:334 topo leader: leader not selected yet
E0122 11:01:32.185739 master_grpc_server.go:334 topo leader: leader not selected yet
E0122 11:01:32.812400 master_grpc_server.go:334 topo leader: leader not selected yet
E0122 11:01:33.528143 master_grpc_server.go:334 topo leader: leader not selected yet
W0122 11:01:33.677255 master_grpc_server.go:112 SendHeartbeat find leader: leader not selected yet
E0122 11:01:33.844042 master_grpc_server.go:334 topo leader: leader not selected yet
W0122 11:01:33.944083 master_grpc_server.go:112 SendHeartbeat find leader: leader not selected yet
E0122 11:01:34.172802 master_grpc_server.go:334 topo leader: leader not selected yet
E0122 11:01:34.711741 master_grpc_server.go:334 topo leader: leader not selected yet
E0122 11:01:35.431621 master_grpc_server.go:334 topo leader: leader not selected yet
E0122 11:01:36.079995 master_grpc_server.go:334 topo leader: leader not selected yet
W0122 11:01:36.144327 master_grpc_server.go:112 SendHeartbeat find leader: leader not selected yet
W0122 11:01:36.554208 master_grpc_server.go:112 SendHeartbeat find leader: leader not selected yet
W0122 11:01:36.573034 master_grpc_server.go:112 SendHeartbeat find leader: leader not selected yet
E0122 11:01:36.643826 master_grpc_server.go:334 topo leader: leader not selected yet
E0122 11:01:36.710947 master_grpc_server.go:334 topo leader: leader not selected yet
I0122 11:01:36.711168 masterclient.go:202 .master masterClient failed to receive from seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333: rpc error: code = Unknown desc = raft.Server: Not current leader
I0122 11:01:36.714083 masterclient.go:210 master seaweedfs1-test-master-2.seaweedfs1-test-master-peer.seaweedfs-test:9333 redirected to leader seaweedfs1-test-master-0.seaweedfs1-test-master-peer.seaweedfs-test:9333
E0122 11:01:36.727741 master_grpc_server.go:334 topo leader: leader not selected yet
E0122 11:01:36.730446 master_grpc_server.go:334 topo leader: leader not selected yet
E0122 11:01:36.762296 master_grpc_server.go:334 topo leader: leader not selected yet
W0122 11:01:36.865193 master_grpc_server.go:112 SendHeartbeat find leader: leader not selected yet
W0122 11:01:36.870232 master_grpc_server.go:112 SendHeartbeat find leader: leader not selected yet
W0122 11:01:37.166880 master_grpc_server.go:112 SendHeartbeat find leader: leader not selected yet
E0122 11:01:37.368705 master_grpc_server.go:334 topo leader: leader not selected yet
Werberus commented 7 months ago

It turned out to reproduce the situation and observe in more detail. At the moment when master-0 (leader) restarts, one of the masters sees that the leader is no longer there, but the second master does not see this. The chronology of events is as follows:

ginkel commented 7 months ago

Is there at least a way to detect the error situation and react by restarting the master process? In Prometheus I can see that all master nodes return SeaweedFS_master_is_leader = 0 for an extended timeframe, but I'm basically looking for something that can be evaluated locally by a health check. Would SeaweedFS_wdclient_connect_updates{type="failed"} help?

CodeRusher commented 3 months ago

We've also encountered this problem; it seems like goraft/raft has some issues. How about switching to hashicorp/raft? In Seaweed, is the support for it mature enough for production use? @chrislusf

CodeRusher commented 3 months ago

What does the following panic mean, and under what circumstances does this panic get triggered? This panic seems to cause the Weed process to exit.

panic: raft: Index is beyond end of log: 3 1252