milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30k stars 2.88k forks source link

[Bug]: After a period of no access traffic, or only small traffic access, the query performance of the milvus cluster drops sharply. #30672

Closed dushulin closed 8 months ago

dushulin commented 8 months ago

Is there an existing issue for this?

Environment

- Milvus version:2.3.2
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): centos
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

After a period of no traffic or only very small query traffic, the query performance of the cluster suddenly dropped drastically. After checking the monitoring, it was found that the performance problem was triggered after the segment num was greatly reduced. As shown below bdf92a94979724c7d0a51e6d469fc15 Looking at other indicator monitoring of querynode, we found that after the number of segment num was reduced to 6, other related lentency and runtime(memory etc.) indicators also began to fluctuate intensified. As shown below bdf92a94979724c7d0a51e6d469fc15 899cc2f589abf32eb81ebd193d45aed image

PS: total vectors are 100w

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

github-actions[bot] commented 8 months ago

The title and description of this issue contains Chinese. Please use English to describe your issue.

xiaofan-luan commented 8 months ago

@dushulin can you give more explain about performance drop?

This might be due to the cached connection purged. but it only should take less than single digit ms

dushulin commented 8 months ago

@dushulin can you give more explain about performance drop?

This might be due to the cached connection purged. but it only should take less than single digit ms

The traffic in our current scenario is also very low. The original query delay was more than ten milliseconds. However, after the segnum was reduced, the query delay increased to more than ten seconds, and the performance dropped thousands of times. This is not a normal phenomenon. , during this period we did not do any operations on the cluster configuration, and the nodes were not restarted.

xiaofan-luan commented 8 months ago

@dushulin can you give more explain about performance drop? This might be due to the cached connection purged. but it only should take less than single digit ms

The traffic in our current scenario is also very low. The original query delay was more than ten milliseconds. However, after the segnum was reduced, the query delay increased to more than ten seconds, and the performance dropped thousands of times. This is not a normal phenomenon. , during this period we did not do any operations on the cluster configuration, and the nodes were not restarted.

This is probably due the complicated expression. and each segment will only run the expression in serial. What expression you are using?

xiaofan-luan commented 8 months ago

For some expressions, especially on Json data type or dynamic field, the filtering could take up to seconds

dushulin commented 8 months ago

For some expressions, especially on Json data type or dynamic field, the filtering could take up to seconds

not complicated expr, just attu data preview page, list collection data. And my schema don’t have json type. Ps: why indicators of mem, cpu usage, available loaded data num are very volatile

yanliang567 commented 8 months ago

@dushulin something suspected are

  1. the cpu usage is very low even before 2/14(before this issue occurs)
  2. the queryable entity number goes to 0 for many times after 2/14 I think we need more info about this issue:
  3. how much the cpu and memory resource for each pod is requested and limited in your config?
  4. could you please share the collection schema, index params and search code sample?
  5. did you insert/delete any data around 2/14(before this issue happened)
  6. please refer this doc to export the whole Milvus logs for investigation
  7. please attach the etcd backup for investigation. Check this: https://github.com/milvus-io/birdwatcher for details about how to backup etcd with birdwatcher
  8. if possible, we'd like to see the metric zoom in the time around 2/14 16:00 when the segment numb decreased

/assign @dushulin

dushulin commented 8 months ago

@dushulin something suspected are

  1. the cpu usage is very low even before 2/14(before this issue occurs)
  2. the queryable entity number goes to 0 for many times after 2/14 I think we need more info about this issue:
  3. how much the cpu and memory resource for each pod is requested and limited in your config?
  4. could you please share the collection schema, index params and search code sample?
  5. did you insert/delete any data around 2/14(before this issue happened)
  6. please refer this doc to export the whole Milvus logs for investigation
  7. please attach the etcd backup for investigation. Check this: https://github.com/milvus-io/birdwatcher for details about how to backup etcd with birdwatcher
  8. if possible, we'd like to see the metric zoom in the time around 2/14 16:00 when the segment numb decreased

/assign @dushulin

I found some warning logs from querynode and etcd around 2/14 querynode logs: [2024/02/14 07:57:13.833 +00:00] [WARN] [pipeline/pipeline.go:51] ["some node(s) haven't received input"] [list="[nodeCtxTtChecker-FilterNode-by-dev-rootcoord-dml_14_446126244104862883v0,nodeCtxTtChecker-InsertNode-by-dev-rootcoord-dml_14_446126244104862883v0,nodeCtxTtChecker-DeleteNode-by-dev-rootcoord-dml_14_446126244104862883v0]"] ["duration "=2m0s] [2024/02/14 07:59:13.833 +00:00] [WARN] [pipeline/pipeline.go:51] ["some node(s) haven't received input"] [list="[nodeCtxTtChecker-FilterNode-by-dev-rootcoord-dml_14_446126244104862883v0,nodeCtxTtChecker-InsertNode-by-dev-rootcoord-dml_14_446126244104862883v0,nodeCtxTtChecker-DeleteNode-by-dev-rootcoord-dml_14_446126244104862883v0]"] ["duration "=2m0s] [2024/02/14 08:01:13.834 +00:00] [WARN] [pipeline/pipeline.go:51] ["some node(s) haven't received input"] [list="[nodeCtxTtChecker-InsertNode-by-dev-rootcoord-dml_14_446126244104862883v0,nodeCtxTtChecker-DeleteNode-by-dev-rootcoord-dml_14_446126244104862883v0,nodeCtxTtChecker-FilterNode-by-dev-rootcoord-dml_14_446126244104862883v0]"] ["duration "=2m0s] [2024/02/14 08:03:13.834 +00:00] [WARN] [pipeline/pipeline.go:51] ["some node(s) haven't received input"] [list="[nodeCtxTtChecker-InsertNode-by-dev-rootcoord-dml_14_446126244104862883v0,nodeCtxTtChecker-DeleteNode-by-dev-rootcoord-dml_14_446126244104862883v0,nodeCtxTtChecker-FilterNode-by-dev-rootcoord-dml_14_446126244104862883v0]"] ["duration "=2m0s] [2024/02/14 08:05:13.833 +00:00] [WARN] [pipeline/pipeline.go:51] ["some node(s) haven't received input"] [list="[nodeCtxTtChecker-InsertNode-by-dev-rootcoord-dml_14_446126244104862883v0,nodeCtxTtChecker-DeleteNode-by-dev-rootcoord-dml_14_446126244104862883v0,nodeCtxTtChecker-FilterNode-by-dev-rootcoord-dml_14_446126244104862883v0]"] ["duration "=2m0s] [2024/02/14 08:07:13.833 +00:00] [WARN] [pipeline/pipeline.go:51] ["some node(s) haven't received input"] [list="[nodeCtxTtChecker-InsertNode-by-dev-rootcoord-dml_14_446126244104862883v0,nodeCtxTtChecker-DeleteNode-by-dev-rootcoord-dml_14_446126244104862883v0,nodeCtxTtChecker-FilterNode-by-dev-rootcoord-dml_14_446126244104862883v0]"] ["duration "=2m0s] [2024/02/14 08:09:13.833 +00:00] [WARN] [pipeline/pipeline.go:51] ["some node(s) haven't received input"] [list="[nodeCtxTtChecker-InsertNode-by-dev-rootcoord-dml_14_446126244104862883v0,nodeCtxTtChecker-DeleteNode-by-dev-rootcoord-dml_14_446126244104862883v0,nodeCtxTtChecker-FilterNode-by-dev-rootcoord-dml_14_446126244104862883v0]"] ["duration "=2m0s] [2024/02/14 08:11:13.833 +00:00] [WARN] [pipeline/pipeline.go:51] ["some node(s) haven't received input"] [list="[nodeCtxTtChecker-InsertNode-by-dev-rootcoord-dml_14_446126244104862883v0,nodeCtxTtChecker-DeleteNode-by-dev-rootcoord-dml_14_446126244104862883v0,nodeCtxTtChecker-FilterNode-by-dev-rootcoord-dml_14_446126244104862883v0]"] ["duration "=2m0s]

etcd logs: {"level":"warn","ts":"2023-12-04T03:04:50.051Z","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"55235cbdd02cb8de","error":"failed to dial 55235cbdd02cb8de on stream Message (dial tcp 10.177.24.215:2380: i/o timeout)"} {"level":"warn","ts":"2024-02-14T07:55:11.958Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"55235cbdd02cb8de","rtt":"246.771178ms","error":"dial tcp 10.177.24.207:2380: i/o timeout"} {"level":"warn","ts":"2024-02-14T07:55:15.297Z","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"b882082d5e2b8796","error":"failed to write b882082d5e2b8796 on stream Message (write tcp 10.177.32.137:2380->10.177.46.200:38658: i/o timeout)"} {"level":"warn","ts":"2024-02-14T07:55:15.298Z","caller":"rafthttp/stream.go:223","msg":"lost TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"256368f3e6b9a206","remote-peer-id":"b882082d5e2b8796"} {"level":"warn","ts":"2024-02-14T07:55:16.959Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"55235cbdd02cb8de","rtt":"246.771178ms","error":"dial tcp 10.177.24.207:2380: i/o timeout"} {"level":"warn","ts":"2024-02-14T07:55:21.959Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"55235cbdd02cb8de","rtt":"246.771178ms","error":"dial tcp 10.177.24.207:2380: i/o timeout"} {"level":"warn","ts":"2024-02-14T07:55:26.959Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"55235cbdd02cb8de","rtt":"246.771178ms","error":"dial tcp 10.177.24.207:2380: i/o timeout"} {"level":"warn","ts":"2024-02-14T07:55:31.960Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"55235cbdd02cb8de","rtt":"246.771178ms","error":"dial tcp 10.177.24.207:2380: i/o timeout"} {"level":"warn","ts":"2024-02-14T07:55:36.961Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"55235cbdd02cb8de","rtt":"246.771178ms","error":"dial tcp 10.177.24.207:2380: i/o timeout"} {"level":"warn","ts":"2024-02-14T07:55:41.962Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"55235cbdd02cb8de","rtt":"246.771178ms","error":"dial tcp 10.177.24.207:2380: i/o timeout"} {"level":"warn","ts":"2024-02-14T07:55:46.963Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"55235cbdd02cb8de","rtt":"246.771178ms","error":"dial tcp 10.177.24.207:2380: i/o timeout"} {"level":"warn","ts":"2024-02-14T07:55:51.546Z","caller":"rafthttp/stream.go:421","msg":"lost TCP streaming connection with remote peer","stream-reader-type":"stream Message","local-member-id":"256368f3e6b9a206","remote-peer-id":"55235cbdd02cb8de","error":"EOF"} {"level":"warn","ts":"2024-02-14T07:55:51.874Z","caller":"rafthttp/peer_status.go:66","msg":"peer became inactive (message send to peer failed)","peer-id":"55235cbdd02cb8de","error":"failed to write 55235cbdd02cb8de on stream Message (write tcp 10.177.32.137:2380->10.177.24.207:44512: write: broken pipe)"} {"level":"warn","ts":"2024-02-14T07:55:51.875Z","caller":"rafthttp/stream.go:223","msg":"lost TCP streaming connection with remote peer","stream-writer-type":"stream Message","local-member-id":"256368f3e6b9a206","remote-peer-id":"55235cbdd02cb8de"} {"level":"warn","ts":"2024-02-14T07:55:52.143Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"b882082d5e2b8796","rtt":"1.166785173s","error":"dial tcp 10.177.46.200:2380: i/o timeout"} {"level":"warn","ts":"2024-02-14T07:55:57.143Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"b882082d5e2b8796","rtt":"1.166785173s","error":"dial tcp: i/o timeout"} {"level":"warn","ts":"2024-02-14T07:56:02.144Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"b882082d5e2b8796","rtt":"1.166785173s","error":"dial tcp: lookup cloudml-milvus-etcd-1.cloudml-milvus-etcd-headless.milvusdb.svc.cluster.local on 10.254.0.2:53: read udp 10.177.32.137:46648->10.254.0.2:53: i/o timeout"}

another etcd logs: {"level":"warn","ts":"2024-02-14T07:53:42.754Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":9770150591355394698,"retry-timeout":"500ms"} {"level":"warn","ts":"2024-02-14T07:53:43.254Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":9770150591355394698,"retry-timeout":"500ms"} {"level":"warn","ts":"2024-02-14T07:53:43.755Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":9770150591355394698,"retry-timeout":"500ms"} {"level":"warn","ts":"2024-02-14T07:53:44.256Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":9770150591355394698,"retry-timeout":"500ms"} {"level":"warn","ts":"2024-02-14T07:53:44.757Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":9770150591355394698,"retry-timeout":"500ms"} {"level":"warn","ts":"2024-02-14T07:53:45.257Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":9770150591355394698,"retry-timeout":"500ms"} {"level":"warn","ts":"2024-02-14T07:53:45.757Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":9770150591355394698,"retry-timeout":"500ms"} {"level":"warn","ts":"2024-02-14T07:53:46.257Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":9770150591355394698,"retry-timeout":"500ms"} {"level":"warn","ts":"2024-02-14T07:53:46.758Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":9770150591355394698,"retry-timeout":"500ms"} {"level":"warn","ts":"2024-02-14T07:53:47.252Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"4.999453798s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"health\" ","response":"","error":"context deadline exceeded"}

xiaofan-luan commented 8 months ago

I guess there are some problem inside the cluster. Might be network or disk consumption.

yanliang567 commented 8 months ago

@dushulin we still need the detailed info as requested above, thx.

dushulin commented 8 months ago

2. 2/14

can we use helm uninstall and helm install to recover cluster?without data lose?

dushulin commented 8 months ago

@dushulin something suspected are

  1. the cpu usage is very low even before 2/14(before this issue occurs)
  2. the queryable entity number goes to 0 for many times after 2/14 I think we need more info about this issue:
  3. how much the cpu and memory resource for each pod is requested and limited in your config?
  4. could you please share the collection schema, index params and search code sample?
  5. did you insert/delete any data around 2/14(before this issue happened)
  6. please refer this doc to export the whole Milvus logs for investigation
  7. please attach the etcd backup for investigation. Check this: https://github.com/milvus-io/birdwatcher for details about how to backup etcd with birdwatcher
  8. if possible, we'd like to see the metric zoom in the time around 2/14 16:00 when the segment numb decreased

/assign @dushulin

dushulin commented 8 months ago

@dushulin something suspected are

  1. the cpu usage is very low even before 2/14(before this issue occurs)
  2. the queryable entity number goes to 0 for many times after 2/14 I think we need more info about this issue:
  3. how much the cpu and memory resource for each pod is requested and limited in your config?
  4. could you please share the collection schema, index params and search code sample?
  5. did you insert/delete any data around 2/14(before this issue happened)
  6. please refer this doc to export the whole Milvus logs for investigation
  7. please attach the etcd backup for investigation. Check this: https://github.com/milvus-io/birdwatcher for details about how to backup etcd with birdwatcher
  8. if possible, we'd like to see the metric zoom in the time around 2/14 16:00 when the segment numb decreased

/assign @dushulin

  1. yes, traffic is always low, so cpu usage are very low
  2. yes, Monitoring curve is inconsistent
  3. resource config is below: rootCoordinator: replicas: 2 activeStandby: enabled: true resources: limits: cpu: 1 memory: 2Gi indexCoordinator: replicas: 2 activeStandby: enabled: true resources: limits: cpu: "1" memory: 1Gi queryCoordinator: replicas: 2 activeStandby: enabled: true resources: limits: cpu: "1" memory: 1Gi dataCoordinator: replicas: 2 activeStandby: enabled: true resources: limits: cpu: "1" memory: 1Gi proxy: replicas: 2 resources: limits: cpu: 2 memory: 4Gi queryNode: replicas: 3 resources: limits: cpu: 6 memory: 12Gi dataNode: replicas: 3 resources: limits: cpu: 2 memory: 16Gi indexNode: replicas: 3 resources: limits: cpu: 4 memory: 14Gi etcd: autoCompactionMode: revision autoCompactionRetention: "1000" extraEnvVars:
    • name: ETCD_QUOTA_BACKEND_BYTES value: "4294967296"
    • name: ETCD_HEARTBEAT_INTERVAL value: "500"
    • name: ETCD_ELECTION_TIMEOUT value: "25000"
    • name: ETCD_SNAPSHOT_COUNT value: "10000"
    • name: ETCD_ENABLE_PPROF value: "true" persistence: accessMode: ReadWriteOnce enabled: true size: 30Gi #SSD Required storageClass: starfs-sc-c5 replicaCount: 3 resources: limits: cpu: 4 memory: 8Gi requests: cpu: 4 memory: 8Gi

pulsar: enabled: true proxy: resources: requests: memory: 8Gi cpu: 2 replicaCount: 2 configData: PULSAR_MEM: > -Xms2048M -Xmx2048M -XX:MaxDirectMemorySize=2048M httpNumThreads: "100" zookeeper: volumes: persistence: true data: name: data size: 20Gi #SSD Required storageClassName: starfs-sc-c5 resources: requests: memory: 8Gi cpu: 2 configData: PULSAR_MEM: > -Xms1024M -Xmx1024M bookkeeper: volumes: journal: name: journal size: 10Gi storageClassName: starfs-sc-c5 ledgers: name: ledgers size: 30Gi #SSD Required storageClassName: starfs-sc-c5 resources: requests: memory: 16Gi cpu: 2 configData: PULSAR_MEM: > -Xms4096M -Xmx4096M -XX:MaxDirectMemorySize=8192M broker: component: broker podMonitor: enabled: false replicaCount: 2 resources: requests: memory: 18Gi cpu: 2 configData: PULSAR_MEM: > -Xms4096M -Xmx4096M -XX:MaxDirectMemorySize=8192M

minio: resources: limits: cpu: 4 memory: 16Gi persistence: storageClass: starfs-sc-c5 accessMode: ReadWriteOnce size: 20Gi

  1. image search and create collection func use official website case, use pymilvus

  2. 1/22-1/25 insert 80w vectors, but after that, cluster run normally
  3. How to give full logs to your team, because I worried about having sensitive company information
  4. Do I need to back up all etcd data? Or use tools to check the status of etcd
  5. image

BTW, We speculate that there may be a problem with one of the physical machines in the cluster. There are querynode, two proxies, and a copy of etcd on this machine. How can we restore the service without downtime? Can I directly delete the problematic querynode node on this physical machine and let it be automatically scheduled to the other two physical machines?

dushulin commented 8 months ago

The cause of the problem was found. It was because the route to a certain physical machine node could not be found, causing the connection before the multiple copies of etcd to fail. After the repair, the cluster returned to normal.

yanliang567 commented 8 months ago

great to hear that. One thing I want to say is that you only set Limit of the resource instead of setting Request equals Limit in the configs. It could lead to lower and unstable performance for resource competition, but it is okay to do that if you are not sensitive in performance.

dushulin commented 8 months ago

3. dataNode: replicas: 3 resources: limits: cpu: 2 memory: 16Gi

you mean add request resource settings?like: dataNode: replicas: 3 resources: limits: cpu: 2 memory: 16Gi request: cpu: 2 memory: 16Gi

yanliang567 commented 8 months ago
  1. dataNode: replicas: 3 resources: limits: cpu: 2 memory: 16Gi

you mean add request resource settings?like: dataNode: replicas: 3 resources: limits: cpu: 2 memory: 16Gi request: cpu: 2 memory: 16Gi

exactly

dushulin commented 8 months ago

thks