milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.26k stars 2.9k forks source link

[Bug]: [benchmark][multi-replicas-loadbalance] In the scenario of multiple clients, high concurrency query and search, that api raise error #25618

Closed wangting0128 closed 1 year ago

wangting0128 commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:master-20230713-ed3e4b0b
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server argo task: fouramf-dltq7 upgrade image: fouramf-h9fwl client argo task: fouramf-concurrent-pjf5f

server:

NAME                                                              READY   STATUS             RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
lb-helm-multi-hnsw-etcd-0                                         1/1     Running            0               31h     10.104.1.87     4am-node10   <none>           <none>
lb-helm-multi-hnsw-etcd-1                                         1/1     Running            0               31h     10.104.18.130   4am-node25   <none>           <none>
lb-helm-multi-hnsw-etcd-2                                         1/1     Running            0               31h     10.104.23.225   4am-node27   <none>           <none>
lb-helm-multi-hnsw-milvus-datacoord-68c885846d-m4xxh              1/1     Running            0               31h     10.104.13.75    4am-node16   <none>           <none>
lb-helm-multi-hnsw-milvus-datanode-7df65cd9d9-h74j4               1/1     Running            0               31h     10.104.13.72    4am-node16   <none>           <none>
lb-helm-multi-hnsw-milvus-datanode-7df65cd9d9-qw77h               1/1     Running            0               31h     10.104.12.222   4am-node17   <none>           <none>
lb-helm-multi-hnsw-milvus-indexcoord-595b6fc7f9-mspl7             1/1     Running            0               31h     10.104.13.74    4am-node16   <none>           <none>
lb-helm-multi-hnsw-milvus-indexnode-bb4c5fcd8-hxzbx               1/1     Running            0               31h     10.104.13.77    4am-node16   <none>           <none>
lb-helm-multi-hnsw-milvus-indexnode-bb4c5fcd8-l6p8s               1/1     Running            0               31h     10.104.4.42     4am-node11   <none>           <none>
lb-helm-multi-hnsw-milvus-proxy-6c59b46c47-n986z                  1/1     Running            0               31h     10.104.24.191   4am-node29   <none>           <none>
lb-helm-multi-hnsw-milvus-querycoord-5685ccdbbd-6gwm8             1/1     Running            0               31h     10.104.13.73    4am-node16   <none>           <none>
lb-helm-multi-hnsw-milvus-querynode-798ffb99f8-g4gxg              1/1     Running            0               31h     10.104.24.192   4am-node29   <none>           <none>
lb-helm-multi-hnsw-milvus-querynode-798ffb99f8-wrpdj              1/1     Running            0               31h     10.104.12.223   4am-node17   <none>           <none>
lb-helm-multi-hnsw-milvus-rootcoord-759777467b-l62p7              1/1     Running            0               31h     10.104.13.76    4am-node16   <none>           <none>
lb-helm-multi-hnsw-minio-0                                        1/1     Running            0               2d21h   10.104.18.144   4am-node25   <none>           <none>
lb-helm-multi-hnsw-minio-1                                        1/1     Running            0               2d21h   10.104.1.104    4am-node10   <none>           <none>
lb-helm-multi-hnsw-minio-2                                        1/1     Running            0               2d21h   10.104.23.50    4am-node27   <none>           <none>
lb-helm-multi-hnsw-minio-3                                        1/1     Running            0               2d21h   10.104.20.29    4am-node22   <none>           <none>
lb-helm-multi-hnsw-pulsar-bookie-0                                1/1     Running            0               2d21h   10.104.18.145   4am-node25   <none>           <none>
lb-helm-multi-hnsw-pulsar-bookie-1                                1/1     Running            0               2d21h   10.104.1.105    4am-node10   <none>           <none>
lb-helm-multi-hnsw-pulsar-bookie-2                                1/1     Running            0               2d21h   10.104.23.53    4am-node27   <none>           <none>
lb-helm-multi-hnsw-pulsar-broker-0                                1/1     Running            0               2d21h   10.104.18.134   4am-node25   <none>           <none>
lb-helm-multi-hnsw-pulsar-proxy-0                                 1/1     Running            0               2d21h   10.104.18.133   4am-node25   <none>           <none>
lb-helm-multi-hnsw-pulsar-recovery-0                              1/1     Running            0               2d21h   10.104.16.165   4am-node21   <none>           <none>
lb-helm-multi-hnsw-pulsar-zookeeper-0                             1/1     Running            0               2d21h   10.104.16.168   4am-node21   <none>           <none>
lb-helm-multi-hnsw-pulsar-zookeeper-1                             1/1     Running            0               2d21h   10.104.1.107    4am-node10   <none>           <none>
lb-helm-multi-hnsw-pulsar-zookeeper-2                             1/1     Running            0               2d21h   10.104.21.135   4am-node24   <none>           <none>

clients logs: clients.log

截屏2023-07-14 18 55 11 截屏2023-07-14 19 02 16

Expected Behavior

No response

Steps To Reproduce

1、deploy cluster Milvus with 2 queryNodes
2、concurrent 10 client which have 2 types: replica=1 and replica=2; each type has 5 clients
   a. create a collection with shard_num=2
   b. insert 5m data, build HNSW index
   c. load with replica=1 or 2
   d. concurrent query and search by locust <- raise error

Milvus Log

No response

Anything else?

fouramf-server-lb-2qn-2dn-large:

queryNode:
  resources:
    limits:
      cpu: '50.0'
      memory: 100Gi
    requests:
      cpu: '25.0'
      memory: 50Gi
  replicas: 2
indexNode:
  resources:
    limits:
      cpu: '8.0'
      memory: 8Gi
    requests:
      cpu: '5.0'
      memory: 5Gi
  replicas: 2
dataNode:
  resources:
    limits:
      cpu: '2.0'
      memory: 16Gi
    requests:
      cpu: '2.0'
      memory: 2Gi
  replicas: 2

fouramf-client-sift-hnsw-replica1-shard2-nq1-search-query-high:

    load_params:
      replica_number: 1
    collection_params:
      shards_num: 2
    dataset_params:
      dim: 128
      dataset_name: sift
      dataset_size: 5m
      ni_per: 50000
      metric_type: L2
    index_params:
      index_type: HNSW
      index_param:
        M: 8
        efConstruction: 200
    concurrent_params:
      concurrent_number: 100
      during_time: 2h
      interval: 20
    concurrent_tasks:
      - type: query
        weight: 1
        params:
          ids: [1, 100, 1000]
      - type: search
        weight: 1
        params:
          nq: 10000
          top_k: 10
          search_param:
            ef: 64
          timeout: 60
          random_data: true

fouramf-client-sift-hnsw-replica2-shard2-nq1-search-query-high:

    load_params:
      replica_number: 2
    collection_params:
      shards_num: 2
    dataset_params:
      dim: 128
      dataset_name: sift
      dataset_size: 5m
      ni_per: 50000
      metric_type: L2
    index_params:
      index_type: HNSW
      index_param:
        M: 8
        efConstruction: 200
    concurrent_params:
      concurrent_number: 100
      during_time: 2h
      interval: 20
    concurrent_tasks:
      - type: query
        weight: 1
        params:
          ids: [1, 100, 1000]
      - type: search
        weight: 1
        params:
          nq: 10000
          top_k: 10
          search_param:
            ef: 64
          timeout: 60
          random_data: true
yanliang567 commented 1 year ago

/unassign

xiaofan-luan commented 1 year ago

/assign @bigsheeper

please help on investigating

weiliu1031 commented 1 year ago

same root cause with https://github.com/milvus-io/milvus/issues/25558#issuecomment-1637326391

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

wangting0128 commented 1 year ago

Recurrent

image: master-20230822-9131a0aa argo task: fouramf-server-client-concurrent-7gdcb

server:

NAME                                                              READY   STATUS      RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
lb-helm-multi-hnsw-etcd-0                                         1/1     Running     0               150m    10.104.21.144   4am-node24   <none>           <none>
lb-helm-multi-hnsw-etcd-1                                         1/1     Running     0               150m    10.104.19.168   4am-node28   <none>           <none>
lb-helm-multi-hnsw-etcd-2                                         1/1     Running     0               150m    10.104.17.91    4am-node23   <none>           <none>
lb-helm-multi-hnsw-milvus-datacoord-746d859f9c-fvnlm              1/1     Running     0               150m    10.104.23.126   4am-node27   <none>           <none>
lb-helm-multi-hnsw-milvus-datanode-5c5dd7c644-glx8b               1/1     Running     0               150m    10.104.21.139   4am-node24   <none>           <none>
lb-helm-multi-hnsw-milvus-datanode-5c5dd7c644-wcflf               1/1     Running     0               150m    10.104.23.129   4am-node27   <none>           <none>
lb-helm-multi-hnsw-milvus-indexcoord-765c679c5d-mwvps             1/1     Running     0               150m    10.104.23.125   4am-node27   <none>           <none>
lb-helm-multi-hnsw-milvus-indexnode-76f549d5bd-8fhgn              1/1     Running     0               150m    10.104.21.140   4am-node24   <none>           <none>
lb-helm-multi-hnsw-milvus-indexnode-76f549d5bd-bbdx2              1/1     Running     0               150m    10.104.9.182    4am-node14   <none>           <none>
lb-helm-multi-hnsw-milvus-proxy-868d597b5b-t5wf2                  1/1     Running     0               150m    10.104.9.180    4am-node14   <none>           <none>
lb-helm-multi-hnsw-milvus-querycoord-655789c645-vmkcv             1/1     Running     0               150m    10.104.23.127   4am-node27   <none>           <none>
lb-helm-multi-hnsw-milvus-querynode-64bf748765-thc5n              1/1     Running     0               150m    10.104.9.184    4am-node14   <none>           <none>
lb-helm-multi-hnsw-milvus-querynode-64bf748765-vwdm2              1/1     Running     0               150m    10.104.23.130   4am-node27   <none>           <none>
lb-helm-multi-hnsw-milvus-rootcoord-676cfd578f-n9t7j              1/1     Running     0               150m    10.104.23.128   4am-node27   <none>           <none>
lb-helm-multi-hnsw-minio-0                                        1/1     Running     0               150m    10.104.23.131   4am-node27   <none>           <none>
lb-helm-multi-hnsw-minio-1                                        1/1     Running     0               150m    10.104.21.143   4am-node24   <none>           <none>
lb-helm-multi-hnsw-minio-2                                        1/1     Running     0               150m    10.104.19.170   4am-node28   <none>           <none>
lb-helm-multi-hnsw-minio-3                                        1/1     Running     0               150m    10.104.17.86    4am-node23   <none>           <none>
lb-helm-multi-hnsw-pulsar-bookie-0                                1/1     Running     0               150m    10.104.21.147   4am-node24   <none>           <none>
lb-helm-multi-hnsw-pulsar-bookie-1                                1/1     Running     0               150m    10.104.19.171   4am-node28   <none>           <none>
lb-helm-multi-hnsw-pulsar-bookie-2                                1/1     Running     0               150m    10.104.17.92    4am-node23   <none>           <none>
lb-helm-multi-hnsw-pulsar-bookie-init-q44gb                       0/1     Completed   0               150m    10.104.9.181    4am-node14   <none>           <none>
lb-helm-multi-hnsw-pulsar-broker-0                                1/1     Running     0               150m    10.104.23.121   4am-node27   <none>           <none>
lb-helm-multi-hnsw-pulsar-proxy-0                                 1/1     Running     0               150m    10.104.23.122   4am-node27   <none>           <none>
lb-helm-multi-hnsw-pulsar-pulsar-init-8szkm                       0/1     Completed   0               150m    10.104.23.119   4am-node27   <none>           <none>
lb-helm-multi-hnsw-pulsar-recovery-0                              1/1     Running     0               150m    10.104.23.120   4am-node27   <none>           <none>
lb-helm-multi-hnsw-pulsar-zookeeper-0                             1/1     Running     0               150m    10.104.23.132   4am-node27   <none>           <none>
lb-helm-multi-hnsw-pulsar-zookeeper-1                             1/1     Running     0               149m    10.104.21.150   4am-node24   <none>           <none>
lb-helm-multi-hnsw-pulsar-zookeeper-2                             1/1     Running     0               148m    10.104.17.94    4am-node23   <none>           <none>
截屏2023-08-23 10 51 52

clients log: clients.log

截屏2023-08-23 10 47 55
weiliu1031 commented 1 year ago

same as https://github.com/milvus-io/milvus/issues/25558#issuecomment-1689205279

wangting0128 commented 1 year ago

Verification passed

image tag: master-20230830-a8e5dc35 argo task: fouramf-server-client-concurrent-xxk6k

截屏2023-08-31 10 44 26