milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.36k stars 2.91k forks source link

[Bug]: Too long time for recovering when ETCD pod failure or network partition #36393

Open chyezh opened 1 month ago

chyezh commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version: master-89397d1e66acc4547561b5f8beb7d43fc87e7c47
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The etcd client in milvus accesses the etcd node namd etcd-0. The etcd-0 node is unavailable due to network partitioning and is in a state of repeated election, cannot apply write operations. The request timeout of the etcd client is too long (9 seconds in logs), so the process have been locked on the requesting node.

{"level":"warn","ts":"2024-09-19T06:06:59.497Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-09-19T06:06:50.496Z","time spent":"9.000383152s","remote":"10.15.1.24:34664","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}
{"level":"info","ts":"2024-09-19T06:06:45.383Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f is starting a new election at term 4"}

Expected Behavior

shorter timeout for etcd operation and retry the operation to healthy etcd node.

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

chyezh commented 1 month ago

/assign chyezh

xiaofan-luan commented 1 month ago

@chyezh so we need to change to etcd timeout to shorter?

We need to retry on etcd fail to avoid etcd error and milvus panic. otherwise tuning the timeout to smaller could be dangerous. We can probably change to 3s with 3-5 retries

chyezh commented 1 month ago

@chyezh so we need to change to etcd timeout to shorter?

We need to retry on etcd fail to avoid etcd error and milvus panic. otherwise tuning the timeout to smaller could be dangerous. We can probably change to 3s with 3-5 retries

Yes, current default request timeout of etcdKV is 10s, it's too long. I need to confirm the implementation of the etcd client, then determine the optimization.

Meanwhile, Milvus' KV interface has a bad implementation, it does not receive RPC context.

func (kv *etcdKV) WalkWithPrefix(prefix string, paginationSize int, fn func([]byte, []byte) error) error {
    start := time.Now()
    prefix = path.Join(kv.rootPath, prefix)

    batch := int64(paginationSize)
    opts := []clientv3.OpOption{
        clientv3.WithSort(clientv3.SortByKey, clientv3.SortAscend),
        clientv3.WithLimit(batch),
        clientv3.WithRange(clientv3.GetPrefixRangeEnd(prefix)),
    }

    key := prefix
    for {
        ctx, cancel := context.WithTimeout(context.TODO(), kv.requestTimeout)
        defer cancel()
        resp, err := kv.getEtcdMeta(ctx, key, opts...)
chyezh commented 1 month ago

Another case found. It may be a bad case for etcd server 3.5.5. The write operation is sent to a follower etcd-1 at term 2, while the leader etcd-2 is lost. The etcd-0 and etcd-1 finish the election on term 3 at 02:44:44, etcd-0 became the new leader; so the write operation should be redirected to new leader as expected.

2024-09-19 02:44:44.347 {"level":"info","ts":"2024-09-19T02:44:44.347Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f is starting a new election at term 2"}
2024-09-19 02:44:44.347 {"level":"info","ts":"2024-09-19T02:44:44.347Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f became pre-candidate at term 2"}
2024-09-19 02:44:44.347 {"level":"info","ts":"2024-09-19T02:44:44.347Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f received MsgPreVoteResp from 3d1b452b9ef8ed7f at term 2"}
2024-09-19 02:44:44.347 {"level":"info","ts":"2024-09-19T02:44:44.347Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f [logterm: 2, index: 1524239] sent MsgPreVote request to 5390be8ecae377e2 at term 2"}
2024-09-19 02:44:44.347 {"level":"info","ts":"2024-09-19T02:44:44.347Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f [logterm: 2, index: 1524239] sent MsgPreVote request to d646976b154728f0 at term 2"}
2024-09-19 02:44:44.347 {"level":"info","ts":"2024-09-19T02:44:44.347Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: 3d1b452b9ef8ed7f lost leader 5390be8ecae377e2 at term 2"}
2024-09-19 02:44:44.347 {"level":"info","ts":"2024-09-19T02:44:44.347Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"d646976b154728f0 [logterm: 2, index: 1524239, vote: 5390be8ecae377e2] cast MsgPreVote for 3d1b452b9ef8ed7f [logterm: 2, index: 1524239] at term 2"}
2024-09-19 02:44:44.348 {"level":"info","ts":"2024-09-19T02:44:44.348Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f received MsgPreVoteResp from d646976b154728f0 at term 2"}
2024-09-19 02:44:44.348 {"level":"info","ts":"2024-09-19T02:44:44.348Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f has received 2 MsgPreVoteResp votes and 0 vote rejections"}
2024-09-19 02:44:44.348 {"level":"info","ts":"2024-09-19T02:44:44.348Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f became candidate at term 3"}
2024-09-19 02:44:44.348 {"level":"info","ts":"2024-09-19T02:44:44.348Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f received MsgVoteResp from 3d1b452b9ef8ed7f at term 3"}
2024-09-19 02:44:44.348 {"level":"info","ts":"2024-09-19T02:44:44.348Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f [logterm: 2, index: 1524239] sent MsgVote request to 5390be8ecae377e2 at term 3"}
2024-09-19 02:44:44.348 {"level":"info","ts":"2024-09-19T02:44:44.348Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f [logterm: 2, index: 1524239] sent MsgVote request to d646976b154728f0 at term 3"}
2024-09-19 02:44:44.349 {"level":"info","ts":"2024-09-19T02:44:44.349Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"d646976b154728f0 [term: 2] received a MsgVote message with higher term from 3d1b452b9ef8ed7f [term: 3]"}
2024-09-19 02:44:44.349 {"level":"info","ts":"2024-09-19T02:44:44.349Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"d646976b154728f0 became follower at term 3"}
2024-09-19 02:44:44.349 {"level":"info","ts":"2024-09-19T02:44:44.349Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"d646976b154728f0 [logterm: 2, index: 1524239, vote: 0] cast MsgVote for 3d1b452b9ef8ed7f [logterm: 2, index: 1524239] at term 3"}
2024-09-19 02:44:44.349 {"level":"info","ts":"2024-09-19T02:44:44.349Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: d646976b154728f0 lost leader 5390be8ecae377e2 at term 3"}
2024-09-19 02:44:44.351 {"level":"info","ts":"2024-09-19T02:44:44.350Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f received MsgVoteResp from d646976b154728f0 at term 3"}
2024-09-19 02:44:44.351 {"level":"info","ts":"2024-09-19T02:44:44.351Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f has received 2 MsgVoteResp votes and 0 vote rejections"}
2024-09-19 02:44:44.351 {"level":"info","ts":"2024-09-19T02:44:44.351Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f became leader at term 3"}
2024-09-19 02:44:44.351 {"level":"info","ts":"2024-09-19T02:44:44.351Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: 3d1b452b9ef8ed7f elected leader 3d1b452b9ef8ed7f at term 3"}
2024-09-19 02:44:44.351 {"level":"info","ts":"2024-09-19T02:44:44.351Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: d646976b154728f0 elected leader 3d1b452b9ef8ed7f at term 3"}
2024-09-19 02:44:50.780 {"level":"info","ts":"2024-09-19T02:44:50.780Z","caller":"etcdserver/raft.go:529","msg":"restarting local member","cluster-id":"d353a2e618515b93","local-member-id":"5390be8ecae377e2","commit-index":1524238}
2024-09-19 02:44:50.781 {"level":"info","ts":"2024-09-19T02:44:50.781Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"5390be8ecae377e2 switched to configuration voters=(4403189114337815935 6021522222292891618 15440194858660604144)"}
2024-09-19 02:44:50.781 {"level":"info","ts":"2024-09-19T02:44:50.781Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"5390be8ecae377e2 became follower at term 2"}
2024-09-19 02:44:50.781 {"level":"info","ts":"2024-09-19T02:44:50.781Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"newRaft 5390be8ecae377e2 [peers: [3d1b452b9ef8ed7f,5390be8ecae377e2,d646976b154728f0], term: 2, commit: 1524238, applied: 1520152, lastindex: 1524239, lastterm: 2]"}
2024-09-19 02:44:50.793 {"level":"info","ts":"2024-09-19T02:44:50.793Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"5390be8ecae377e2 [term: 2] received a MsgHeartbeat message with higher term from 3d1b452b9ef8ed7f [term: 3]"}
2024-09-19 02:44:50.793 {"level":"info","ts":"2024-09-19T02:44:50.793Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"5390be8ecae377e2 became follower at term 3"}
2024-09-19 02:44:50.793 {"level":"info","ts":"2024-09-19T02:44:50.793Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: 5390be8ecae377e2 elected leader 3d1b452b9ef8ed7f at term 3"}
2024-09-19 02:44:52.729 {"level":"warn","ts":"2024-09-19T02:44:52.729Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-09-19T02:44:43.729Z","time spent":"9.000073827s","remote":"10.15.1.24:57410","response type":"/etcdserverpb.KV/Put","request count":1,"request size":44,"response count":0,"response size":0,"request content":"key:\"92461f86230a8d5/kv/gid/timestamp\" value_size:8 "}
chyezh commented 1 month ago

At etcd 3.5.5, there's no timeout control or redirect at server-side when leader changing.

Summary here:

  1. Sent to a follower lost leader will block until the etcd operation timeout at client side.
  2. Sent to a leader that network partitioned will block until the etcd operation timeout at client side.
xiaofan-luan commented 1 month ago

Seems that we need to seprate etcd timeout and etcdkv timeout(or operation timeout) by ourselves

stale[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

chyezh commented 1 week ago

/reopen