milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.26k stars 2.9k forks source link

[Bug]: [benchmark][multi-replicas-loadbalance] The number of segments less than before after transferring replica #25518

Closed wangting0128 closed 1 year ago

wangting0128 commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:master-20230711-96c987ed
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.0.dev73
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server argo task: fouramf-8tm52 concurrent 2 clients argo task: fouramf-concurrent-2qhj6 transfer replica argo task: fouramf-pfbbf

server:

NAME                                                              READY   STATUS             RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
lb-helm-go-rg3-transfer-etcd-0                                    1/1     Running            0               18h     10.104.17.46    4am-node23   <none>           <none>
lb-helm-go-rg3-transfer-etcd-1                                    1/1     Running            0               18h     10.104.1.61     4am-node10   <none>           <none>
lb-helm-go-rg3-transfer-etcd-2                                    1/1     Running            0               18h     10.104.24.70    4am-node29   <none>           <none>
lb-helm-go-rg3-transfer-milvus-datacoord-75475fb96d-4fcnk         1/1     Running            0               18h     10.104.22.254   4am-node26   <none>           <none>
lb-helm-go-rg3-transfer-milvus-datanode-787f57fd7d-7fgpj          1/1     Running            0               18h     10.104.22.246   4am-node26   <none>           <none>
lb-helm-go-rg3-transfer-milvus-indexcoord-845756c796-khwzr        1/1     Running            0               18h     10.104.22.253   4am-node26   <none>           <none>
lb-helm-go-rg3-transfer-milvus-indexnode-dc7c879b9-pslrs          1/1     Running            0               18h     10.104.20.89    4am-node22   <none>           <none>
lb-helm-go-rg3-transfer-milvus-proxy-dc5c84888-b8wcm              1/1     Running            0               18h     10.104.17.40    4am-node23   <none>           <none>
lb-helm-go-rg3-transfer-milvus-querycoord-8f4bfd447-x2mh8         1/1     Running            0               18h     10.104.22.248   4am-node26   <none>           <none>
lb-helm-go-rg3-transfer-milvus-querynode-59c74696b-9cc54          1/1     Running            0               18h     10.104.24.73    4am-node29   <none>           <none>
lb-helm-go-rg3-transfer-milvus-querynode-59c74696b-gppck          1/1     Running            0               18h     10.104.15.37    4am-node20   <none>           <none>
lb-helm-go-rg3-transfer-milvus-querynode-59c74696b-pzxv7          1/1     Running            0               18h     10.104.6.83     4am-node13   <none>           <none>
lb-helm-go-rg3-transfer-milvus-rootcoord-58c7df765f-zltht         1/1     Running            0               18h     10.104.22.5     4am-node26   <none>           <none>
lb-helm-go-rg3-transfer-minio-0                                   1/1     Running            0               18h     10.104.23.91    4am-node27   <none>           <none>
lb-helm-go-rg3-transfer-minio-1                                   1/1     Running            0               18h     10.104.24.68    4am-node29   <none>           <none>
lb-helm-go-rg3-transfer-minio-2                                   1/1     Running            0               18h     10.104.17.47    4am-node23   <none>           <none>
lb-helm-go-rg3-transfer-minio-3                                   1/1     Running            0               18h     10.104.21.194   4am-node24   <none>           <none>
lb-helm-go-rg3-transfer-pulsar-bookie-0                           1/1     Running            0               18h     10.104.23.94    4am-node27   <none>           <none>
lb-helm-go-rg3-transfer-pulsar-bookie-1                           1/1     Running            0               18h     10.104.5.64     4am-node12   <none>           <none>
lb-helm-go-rg3-transfer-pulsar-bookie-2                           1/1     Running            0               18h     10.104.22.10    4am-node26   <none>           <none>
lb-helm-go-rg3-transfer-pulsar-bookie-init-rhwls                  0/1     Completed          0               18h     10.104.22.247   4am-node26   <none>           <none>
lb-helm-go-rg3-transfer-pulsar-broker-0                           1/1     Running            0               18h     10.104.15.33    4am-node20   <none>           <none>
lb-helm-go-rg3-transfer-pulsar-proxy-0                            1/1     Running            0               18h     10.104.21.192   4am-node24   <none>           <none>
lb-helm-go-rg3-transfer-pulsar-pulsar-init-g2wt2                  0/1     Completed          0               18h     10.104.22.251   4am-node26   <none>           <none>
lb-helm-go-rg3-transfer-pulsar-recovery-0                         1/1     Running            0               18h     10.104.4.25     4am-node11   <none>           <none>
lb-helm-go-rg3-transfer-pulsar-zookeeper-0                        1/1     Running            0               18h     10.104.17.43    4am-node23   <none>           <none>
lb-helm-go-rg3-transfer-pulsar-zookeeper-1                        1/1     Running            0               18h     10.104.24.72    4am-node29   <none>           <none>
lb-helm-go-rg3-transfer-pulsar-zookeeper-2                        1/1     Running            0               18h     10.104.19.205   4am-node28   <none>           <none>
截屏2023-07-12 12 07 20

client:

截屏2023-07-12 12 07 04 截屏2023-07-12 12 07 52

image

Expected Behavior

No response

Steps To Reproduce

1、deploy cluster Milvus with 3 queryNodes
2、concurrent 2 clients:

    config:fouramf-client-sift-rg-replica1-shard1-ivfsq8-go-search-nq1000
    a.create collection RG_1 with shard=1, insert 10m data, build IVF_SQ8 index, load with RG=['RG_1'] replica=1, concurrent seacrch 10h

    config: fouramf-client-sift-rg2-replica1-shard1-ivfsq8-go-search-nq1
    a.create collection RG_2 with shard=1, insert 10m data, build IVF_SQ8 index, load with RG=['RG_2', 'RG_3'] replica=2, concurrent seacrch 2h

3、transfer replica on node RG_2 of collection RG_2 to node RG_1, concurrent search 2h 《- lose segments

Milvus Log

No response

Anything else?

fouramf-server-lb-3qn:

    queryNode:
      resources:
        limits:
          cpu: '16.0'
          memory: 16Gi
        requests:
          cpu: '8.0'
          memory: 4Gi
      replicas: 3
    indexNode:
      resources:
        limits:
          cpu: '8.0'
          memory: 8Gi
        requests:
          cpu: '5.0'
          memory: 5Gi
      replicas: 1
    dataNode:
      resources:
        limits:
          cpu: '2.0'
          memory: 2Gi
        requests:
          cpu: '2.0'
          memory: 2Gi

fouramf-client-sift-rg-replica1-shard1-ivfsq8-go-search-nq1000:

    resource_groups_params:
      reset: false
      groups:
        transfer_nodes:
          - source: '__default_resource_group'
            target: 'RG_1'
            num_node: 1
    load_params:
      replica_number: 1
      _resource_groups: ['RG_1']
    collection_params:
      collection_name: 'RG_1'
      shards_num: 1
    dataset_params:
      dim: 128
      dataset_name: sift
      dataset_size: 10m
      ni_per: 50000
      metric_type: L2
    search_params:
      top_k:
        - 10
      nq:
        - 1000
      search_param:
        nprobe:
          - 64
    index_params:
      index_type: IVF_SQ8
      index_param:
        nlist: 2048
    go_search_params:
      concurrent_number: 100
      during_time: 10h
      interval: 20

fouramf-client-sift-rg2-replica1-shard1-ivfsq8-go-search-nq1:

    resource_groups_params:
      reset: false
      groups:
        transfer_nodes:
          - source: '__default_resource_group'
            target: 'RG_2'
            num_node: 1
          - source: '__default_resource_group'
            target: 'RG_3'
            num_node: 1
    load_params:
      replica_number: 2
      _resource_groups: ['RG_2', 'RG_3']
    collection_params:
      collection_name: 'RG_2'
      shards_num: 1
    dataset_params:
      dim: 128
      dataset_name: sift
      dataset_size: 10m
      ni_per: 50000
      metric_type: L2
    search_params:
      top_k:
        - 1
      nq:
        - 1
      search_param:
        nprobe:
          - 64
    index_params:
      index_type: IVF_SQ8
      index_param:
        nlist: 2048
    go_search_params:
      concurrent_number: 3
      during_time: 2h
      interval: 20

transfer replica fouramf-client-sift-replica-rg2-transfer-rg1:

    resource_groups_params:
      reset: false
      groups:
        transfer_replicas:
          - source: 'RG_2'
            target: 'RG_1'
            collection_name: 'RG_2'
            num_replica: 1
    load_params:
      replica_number: 2
      _resource_groups: ['RG_1', 'RG_3']
    collection_params:
      collection_name: 'RG_2'
      shards_num: 1
    dataset_params:
      dim: 128
      dataset_name: sift
      dataset_size: 10m
      ni_per: 50000
      metric_type: L2
    search_params:
      top_k:
        - 1
      nq:
        - 1
      search_param:
        nprobe:
          - 64
    index_params:
      index_type: IVF_SQ8
      index_param:
        nlist: 2048
    go_search_params:
      concurrent_number: 3
      during_time: 2h
      interval: 20
yanliang567 commented 1 year ago

/unassign

weiliu1031 commented 1 year ago

same as https://github.com/milvus-io/milvus/issues/25679#issuecomment-1640000287

wangting0128 commented 1 year ago

verify image: master-20230719-e418ab2f server argo task: fouramf-pt6ls clients argo task: fouramf-concurrent-xkbjg RG_2 transfer to RG_1 argo task: fouramf-mtvpg RG_1 transfer to RG_2 argo task: fouramf-5sxmm

截屏2023-07-21 12 08 04 截屏2023-07-21 12 07 56 截屏2023-07-21 12 07 47
wangting0128 commented 1 year ago

verified as above