milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.52k stars 2.83k forks source link

[Enhancement]: Too long time for recovering when ETCD pod failure or network partition #36394

Closed chyezh closed 2 hours ago

chyezh commented 2 hours ago

Is there an existing issue for this?

What would you like to be added?

The etcd client in milvus accesses the etcd node namd etcd-0. The etcd-0 node is unavailable due to network partitioning and is in a state of repeated election, cannot apply write operations. The request timeout of the etcd client is too long (9 seconds in logs), so the process have been locked on the requesting node.

{"level":"warn","ts":"2024-09-19T06:06:59.497Z","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2024-09-19T06:06:50.496Z","time spent":"9.000383152s","remote":"10.15.1.24:34664","response type":"/etcdserverpb.KV/Txn","request count":0,"request size":0,"response count":0,"response size":0,"request content":""}
{"level":"info","ts":"2024-09-19T06:06:45.383Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"3d1b452b9ef8ed7f is starting a new election at term 4"}

Why is this needed?

No response

Anything else?

No response

chyezh commented 2 hours ago

not a enhancement, but a bug. see #36393