milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.51k stars 2.83k forks source link

[Bug]: [ResourceGroup] Search fails even though there are replica available #22231

Closed ThreadDao closed 1 year ago

ThreadDao commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: master-20230215-9346f175
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus-2.3.0.dev34 
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

There are two replicas of this collection in two resource groups, one of which is unavailable due to scale-in querynode and insufficient memory, and the other is available. But the search returns exception

[2023-02-16 09:06:53,206 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=4, reason=Search 8 failed, reason QueryNode 42 can't serve, recovering: target node id not match target id = 8, node id = 42 err %!w(<nil>))>, <Time:{'RPC start': '2023-02-16 09:06:53.173397', 'RPC error': '2023-02-16 09:06:53.206773'}> (decorators.py:108)
  1. deploy a cluster with 4 querynodes
  2. create collection and insert 2m data, create index
  3. create a resource group RG_0 and transfer 2 querynodes into RG_0 from default RG
  4. load collection with 2 replicas into 2 RGs
    collection.load(collName, replica_number=2, _resource_groups=['RG_0', '__default_resource_group']
  5. search successfully
  6. scale-in 1 querynodes (only 3 querynodes left)
  7. The RG with offline qn does cannot balance the segments to the online qn due to insufficient memory, so one of the replica is unavailable.
  8. Search failed

Expected Behavior

The search is expected to succeed because one of the replicas is available

Steps To Reproduce

argo workflow name: fouram-scale-9svxt
run the case test_scale_in_qns_memory in fouram repo. contact me for specific questions.

Milvus Log

You can search for logs in Loki. if you want the raw log file for some components, tell me. cluster: devops namespace: chaos-testing server pods:

fouram-scale-9svxt-op-63-4055-milvus-datacoord-5fb48b7cd7-qmbwj     Running     0            2m      10.102.7.213      devops-node11     
fouram-scale-9svxt-op-63-4055-milvus-datanode-9b689864-pktwt        Running     0            2m      10.102.10.15      devops-node20     
fouram-scale-9svxt-op-63-4055-milvus-indexcoord-97cd4888f-7tb5x     Running     0            2m      10.102.7.215      devops-node11     
fouram-scale-9svxt-op-63-4055-milvus-indexnode-5f5f694dd4-dv89k     Running     0            2m      10.102.7.214      devops-node11     
fouram-scale-9svxt-op-63-4055-milvus-indexnode-5f5f694dd4-vwlkc     Running     0            2m      10.102.9.13       devops-node13     
fouram-scale-9svxt-op-63-4055-milvus-proxy-758b57df58-9vbz2         Running     0            2m      10.102.9.12       devops-node13     
fouram-scale-9svxt-op-63-4055-milvus-querycoord-7c7b75765ccvnr9     Running     0            2m      10.102.5.153      devops-node21     
fouram-scale-9svxt-op-63-4055-milvus-querynode-77c44d5ff6-2qlwx     Running     0            2m      10.102.5.156      devops-node21     
fouram-scale-9svxt-op-63-4055-milvus-querynode-77c44d5ff6-888g8     Running     0            2m      10.102.7.216      devops-node11     
fouram-scale-9svxt-op-63-4055-milvus-querynode-77c44d5ff6-bzrkq     Running     0            2m      10.102.6.204      devops-node10     
fouram-scale-9svxt-op-63-4055-milvus-querynode-77c44d5ff6-jrklj     Running     0            2m      10.102.10.16      devops-node20     
fouram-scale-9svxt-op-63-4055-milvus-rootcoord-5c766bb9dc-krj7r     Running     0            2m      10.102.7.212      devops-node11     
fouram-scale-9svxt-op-63-4055-etcd-0                                Running     0            6m      10.102.5.138      devops-node21     
fouram-scale-9svxt-op-63-4055-etcd-1                                Running     0            6m      10.102.7.198      devops-node11     
fouram-scale-9svxt-op-63-4055-etcd-2                                Running     0            6m      10.102.10.3       devops-node20     
fouram-scale-9svxt-op-63-4055-pulsar-bookie-0                       Running     0            6m      10.102.7.203      devops-node11     
fouram-scale-9svxt-op-63-4055-pulsar-bookie-1                       Running     0            6m      10.102.6.200      devops-node10     
fouram-scale-9svxt-op-63-4055-pulsar-bookie-2                       Running     0            6m      10.102.5.147      devops-node21     
fouram-scale-9svxt-op-63-4055-pulsar-broker-0                       Running     0            6m      10.102.5.131      devops-node21     
fouram-scale-9svxt-op-63-4055-pulsar-proxy-0                        Running     0            6m      10.102.10.254     devops-node20     
fouram-scale-9svxt-op-63-4055-pulsar-recovery-0                     Running     0            6m      10.102.5.130      devops-node21     
fouram-scale-9svxt-op-63-4055-pulsar-zookeeper-0                    Running     0            6m      10.102.5.136      devops-node21     
fouram-scale-9svxt-op-63-4055-pulsar-zookeeper-1                    Running     0            6m      10.102.7.209      devops-node11     
fouram-scale-9svxt-op-63-4055-pulsar-zookeeper-2                    Running     0            5m      10.102.9.11       devops-node13     
fouram-scale-9svxt-op-63-4055-minio-0                               Running     0            6m      10.102.5.137      devops-node21     
fouram-scale-9svxt-op-63-4055-minio-1                               Running     0            6m      10.102.10.5       devops-node20     
fouram-scale-9svxt-op-63-4055-minio-2                               Running     0            6m      10.102.7.207      devops-node11     
fouram-scale-9svxt-op-63-4055-minio-3                               Running     0            6m      10.102.9.8        devops-node13     

Anything else?

No response

yanliang567 commented 1 year ago

/assign @weiliu1031 /unassign

ThreadDao commented 1 year ago

Update: image: master-20230218-57f8de95 pymilvus: pymilvus 2.3.0.dev38

argo workflow: fouram-scale-55qfb-wkltr

Test steps:

  1. deploy cluster with 4 querynodes, and resource config
      resources:
        limits:
          cpu: "1"
          memory: 3Gi
        requests:
          cpu: "1"
          memory: 2Gi
  2. Create resource group RG_0 and transfer 2 nodes, create RG_1 and transfer 2 nodes
  3. Create a collection and build index {'index_type': 'IVF_FLAT', 'metric_type': 'L2', 'params': {'nlist': 512}}
  4. Insert 2m entities
  5. Load 2 replicas into [RG_0, RG_1]
  6. Search 100 times and successfully
  7. Scale-in one querynode and it's replicas is 3
  8. load and search after milvus healthy

According to the above steps, there should be one replica remaining available. The other replica's querynode will OOMKilled due to scale-in and insufficient memory.

Before scale-in, the one replica in qn-6 and qn-5, the other replica in qn-2 and qn-10.

- Group: <group_id:439581689488343041>, <group_nodes:(6, 5)>, <shards:[Shard: <channel_name:fouram-scale-55qfb-wkltr-op-97-2917-rootcoord-dml_0_439581687858069512v0>, <shard_leader:6>, <shard_nodes:[6, 6, 6, 5, 5]>, Shard: <channel_name:fouram-scale-55qfb-wkltr-op-97-2917-rootcoord-dml_1_439581687858069512v1>, <shard_leader:5>, <shard_nodes:[5, 6, 6, 5, 5]>]>
- Group: <group_id:439581689488343042>, <group_nodes:(2, 10)>, <shards:[Shard: <channel_name:fouram-scale-55qfb-wkltr-op-97-2917-rootcoord-dml_0_439581687858069512v0>, <shard_leader:2>, <shard_nodes:[2, 2, 2, 10, 10]>, Shard: <channel_name:fouram-scale-55qfb-wkltr-op-97-2917-rootcoord-dml_1_439581687858069512v1>, <shard_leader:10>, <shard_nodes:[10, 2, 2, 10, 10]>]> (base.py:448)
[2023-02-20 05:40:00,394 -  INFO - fouram]: [PerfTemplate] Actual parameters used: {'dataset_params': {'dim': 128, 'dataset_name': 'sift', 'dataset_size': 2000000, 'ni_per': 50000, 'metric_type': 'L2', 'req_run_counts': 100}, 'collection_params': {'other_fields': ['int64_1', 'int64_2', 'float_1', 'double_1', 'varchar_1']}, 'load_params': {'replica_number': 2, '_resource_groups': ['RG_0', 'RG_1']}, 'search_params': {'nq': 1, 'param': {'metric_type': 'L2', 'params': {'nprobe': 16}}, 'top_k': 10, 'expr': None}, 'resource_groups_params': {'groups': [2, 2], 'reset': True}, 'index_params': {'index_type': 'IVF_FLAT', 'index_param': {'nlist': 512}}} (performance_template.py:57)

After scale-in qn-10, the one replica in qn-2, the other replica in qn-6.

- Group: <group_id:439581689488343041>, <group_nodes:(6,)>, <shards:[Shard: <channel_name:fouram-scale-55qfb-wkltr-op-97-2917-rootcoord-dml_0_439581687858069512v0>, <shard_leader:6>, <shard_nodes:[6, 6, 6, 6]>, Shard: <channel_name:fouram-scale-55qfb-wkltr-op-97-2917-rootcoord-dml_1_439581687858069512v1>, <shard_leader:6>, <shard_nodes:[6, 6, 6, 6]>]>
- Group: <group_id:439581689488343042>, <group_nodes:(2,)>, <shards:[Shard: <channel_name:fouram-scale-55qfb-wkltr-op-97-2917-rootcoord-dml_1_439581687858069512v1>, <shard_leader:2>, <shard_nodes:[2, 2, 2]>, Shard: <channel_name:fouram-scale-55qfb-wkltr-op-97-2917-rootcoord-dml_0_439581687858069512v0>, <shard_leader:2>, <shard_nodes:[2, 2, 2]>]>

However, none of the replicas is fully loaded with 4 segments and 2m entities:

c.get_query_segment_info('fouram_l9mj3l2F')
[segmentID: 439581687860870461
collectionID: 439581687858069512
partitionID: 439581687858069513
num_rows: 524736
state: Sealed
nodeIds: 2
nodeIds: 6
, segmentID: 439581687860870620
collectionID: 439581687858069512
partitionID: 439581687858069513
num_rows: 475258
state: Sealed
nodeIds: 2
nodeIds: 6
, segmentID: 439581687860870619
collectionID: 439581687858069512
partitionID: 439581687858069513
num_rows: 475316
state: Sealed
nodeIds: 6
]

pod name:

fouram-scale-55qfb-wkltr-op-97-2917-milvus-datacoord-695bbsgbwf   1/1     Running     0               6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-datanode-7cc557ch7wj   1/1     Running     0               6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-indexcoord-7bb6d9bmh   1/1     Running     0               6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-indexnode-5699f992l7   1/1     Running     0               6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-proxy-6ffb54456whjts   1/1     Running     0               6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-querycoord-6489n6jt7   1/1     Running     0               6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-querynode-849c9cbqgd   1/1     Running     1 (6h38m ago)   6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-querynode-849c9llxcd   1/1     Running     0               6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-querynode-849c9pr55s   1/1     Running     0               6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-rootcoord-776df6pbpt   1/1     Running     0               6h47m
weiliu1031 commented 1 year ago

please verify this

weiliu1031 commented 1 year ago

/assign @ThreadDao

ThreadDao commented 1 year ago

Verified, after node transferred, segments balanced image: master-20230227-b758c305 argo workflow: fouram-scale-outt replica and segmentsinfo before scale:

- Group: <group_id:439740487332528129>, <group_nodes:(9,)>, <shards:[Shard: <channel_name:by-dev-rootcoord-dml_0_439740485395808261v0>, <shard_leader:9>, <shard_nodes:[9, 9, 9, 9, 9, 9, 9, 9, 9]>, Shard: <channel_name:by-dev-rootcoord-dml_1_439740485395808261v1>, <shard_leader:9>, <shard_nodes:[9, 9, 9, 9, 9, 9, 9, 9, 9]>]>
- Group: <group_id:439740487332528130>, <group_nodes:(1,)>, <shards:[Shard: <channel_name:by-dev-rootcoord-dml_0_439740485395808261v0>, <shard_leader:1>, <shard_nodes:[1, 1, 1, 1, 1, 1, 1, 1, 1]>, Shard: <channel_name:by-dev-rootcoord-dml_1_439740485395808261v1>, <shard_leader:1>, <shard_nodes:[1, 1, 1, 1, 1, 1, 1, 1, 1]>]> (base.py:448)

# segments
[segmentID: 439740485397808858
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 175008
state: Sealed
nodeIds: 9
nodeIds: 1
, segmentID: 439740485398209064
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 124956
state: Sealed
nodeIds: 9
nodeIds: 1
, segmentID: 439740485398609213
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 349574
state: Sealed
nodeIds: 9
nodeIds: 1
, segmentID: 439740485397408734
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 375280
state: Sealed
nodeIds: 9
nodeIds: 1
, segmentID: 439740485398409088
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 99972
state: Sealed
nodeIds: 9
nodeIds: 1
, segmentID: 439740485397808879
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 175180
state: Sealed
nodeIds: 9
nodeIds: 1
, segmentID: 439740485398609214
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 350302
state: Sealed
nodeIds: 9
nodeIds: 1
, segmentID: 439740485397408731
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 349728
state: Sealed
nodeIds: 9
nodeIds: 1
]

replicas and segments info afetr scale and transfer from default rg to RG_0:

- Group: <group_id:439740487332528129>, <group_nodes:(9, 10)>, <shards:[Shard: <channel_name:by-dev-rootcoord-dml_0_439740485395808261v0>, <shard_leader:9>, <shard_nodes:[9, 9, 9, 9, 10, 10, 10, 10]>, Shard: <channel_name:by-dev-rootcoord-dml_1_439740485395808261v1>, <shard_leader:9>, <shard_nodes:[9, 9, 9, 9, 10, 10, 10, 10]>]>
- Group: <group_id:439740487332528130>, <group_nodes:(4, 1)>, <shards:[Shard: <channel_name:by-dev-rootcoord-dml_0_439740485395808261v0>, <shard_leader:1>, <shard_nodes:[1, 1, 1, 4, 4, 4, 4]>, Shard: <channel_name:by-dev-rootcoord-dml_1_439740485395808261v1>, <shard_leader:1>, <shard_nodes:[1, 1, 1, 4, 4, 4, 4]>]> (base.py:448)
[2023-02-27 05:53:43,237 -  INFO - fouram]: [Base] collection fouram_0jJpzetZ query segment info: [segmentID: 439740485398609326
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 649692
state: Sealed
nodeIds: 4
nodeIds: 9
, segmentID: 439740485398609214
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 350302
state: Sealed
nodeIds: 4
nodeIds: 10
, segmentID: 439740485397408734
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 375280
state: Sealed
nodeIds: 4
nodeIds: 10
, segmentID: 439740485397808879
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 175180
state: Sealed
nodeIds: 4
nodeIds: 10
, segmentID: 439740485398609213
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 349574
state: Sealed
nodeIds: 9
nodeIds: 1
, segmentID: 439740485398409088
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 99972
state: Sealed
nodeIds: 9
nodeIds: 1
]