Closed ThreadDao closed 1 year ago
/assign @weiliu1031 /unassign
Update:
image: master-20230218-57f8de95
pymilvus: pymilvus 2.3.0.dev38
argo workflow: fouram-scale-55qfb-wkltr
Test steps:
resources:
limits:
cpu: "1"
memory: 3Gi
requests:
cpu: "1"
memory: 2Gi
RG_0
and transfer 2 nodes, create RG_1
and transfer 2 nodes{'index_type': 'IVF_FLAT', 'metric_type': 'L2', 'params': {'nlist': 512}}
[RG_0, RG_1]
According to the above steps, there should be one replica remaining available. The other replica's querynode will OOMKilled due to scale-in and insufficient memory.
Before scale-in, the one replica in qn-6 and qn-5, the other replica in qn-2 and qn-10.
- Group: <group_id:439581689488343041>, <group_nodes:(6, 5)>, <shards:[Shard: <channel_name:fouram-scale-55qfb-wkltr-op-97-2917-rootcoord-dml_0_439581687858069512v0>, <shard_leader:6>, <shard_nodes:[6, 6, 6, 5, 5]>, Shard: <channel_name:fouram-scale-55qfb-wkltr-op-97-2917-rootcoord-dml_1_439581687858069512v1>, <shard_leader:5>, <shard_nodes:[5, 6, 6, 5, 5]>]>
- Group: <group_id:439581689488343042>, <group_nodes:(2, 10)>, <shards:[Shard: <channel_name:fouram-scale-55qfb-wkltr-op-97-2917-rootcoord-dml_0_439581687858069512v0>, <shard_leader:2>, <shard_nodes:[2, 2, 2, 10, 10]>, Shard: <channel_name:fouram-scale-55qfb-wkltr-op-97-2917-rootcoord-dml_1_439581687858069512v1>, <shard_leader:10>, <shard_nodes:[10, 2, 2, 10, 10]>]> (base.py:448)
[2023-02-20 05:40:00,394 - INFO - fouram]: [PerfTemplate] Actual parameters used: {'dataset_params': {'dim': 128, 'dataset_name': 'sift', 'dataset_size': 2000000, 'ni_per': 50000, 'metric_type': 'L2', 'req_run_counts': 100}, 'collection_params': {'other_fields': ['int64_1', 'int64_2', 'float_1', 'double_1', 'varchar_1']}, 'load_params': {'replica_number': 2, '_resource_groups': ['RG_0', 'RG_1']}, 'search_params': {'nq': 1, 'param': {'metric_type': 'L2', 'params': {'nprobe': 16}}, 'top_k': 10, 'expr': None}, 'resource_groups_params': {'groups': [2, 2], 'reset': True}, 'index_params': {'index_type': 'IVF_FLAT', 'index_param': {'nlist': 512}}} (performance_template.py:57)
After scale-in qn-10, the one replica in qn-2, the other replica in qn-6.
- Group: <group_id:439581689488343041>, <group_nodes:(6,)>, <shards:[Shard: <channel_name:fouram-scale-55qfb-wkltr-op-97-2917-rootcoord-dml_0_439581687858069512v0>, <shard_leader:6>, <shard_nodes:[6, 6, 6, 6]>, Shard: <channel_name:fouram-scale-55qfb-wkltr-op-97-2917-rootcoord-dml_1_439581687858069512v1>, <shard_leader:6>, <shard_nodes:[6, 6, 6, 6]>]>
- Group: <group_id:439581689488343042>, <group_nodes:(2,)>, <shards:[Shard: <channel_name:fouram-scale-55qfb-wkltr-op-97-2917-rootcoord-dml_1_439581687858069512v1>, <shard_leader:2>, <shard_nodes:[2, 2, 2]>, Shard: <channel_name:fouram-scale-55qfb-wkltr-op-97-2917-rootcoord-dml_0_439581687858069512v0>, <shard_leader:2>, <shard_nodes:[2, 2, 2]>]>
However, none of the replicas is fully loaded with 4 segments and 2m entities:
c.get_query_segment_info('fouram_l9mj3l2F')
[segmentID: 439581687860870461
collectionID: 439581687858069512
partitionID: 439581687858069513
num_rows: 524736
state: Sealed
nodeIds: 2
nodeIds: 6
, segmentID: 439581687860870620
collectionID: 439581687858069512
partitionID: 439581687858069513
num_rows: 475258
state: Sealed
nodeIds: 2
nodeIds: 6
, segmentID: 439581687860870619
collectionID: 439581687858069512
partitionID: 439581687858069513
num_rows: 475316
state: Sealed
nodeIds: 6
]
pod name:
fouram-scale-55qfb-wkltr-op-97-2917-milvus-datacoord-695bbsgbwf 1/1 Running 0 6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-datanode-7cc557ch7wj 1/1 Running 0 6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-indexcoord-7bb6d9bmh 1/1 Running 0 6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-indexnode-5699f992l7 1/1 Running 0 6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-proxy-6ffb54456whjts 1/1 Running 0 6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-querycoord-6489n6jt7 1/1 Running 0 6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-querynode-849c9cbqgd 1/1 Running 1 (6h38m ago) 6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-querynode-849c9llxcd 1/1 Running 0 6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-querynode-849c9pr55s 1/1 Running 0 6h47m
fouram-scale-55qfb-wkltr-op-97-2917-milvus-rootcoord-776df6pbpt 1/1 Running 0 6h47m
please verify this
/assign @ThreadDao
Verified, after node transferred, segments balanced
image: master-20230227-b758c305
argo workflow: fouram-scale-outt
replica and segmentsinfo before scale:
- Group: <group_id:439740487332528129>, <group_nodes:(9,)>, <shards:[Shard: <channel_name:by-dev-rootcoord-dml_0_439740485395808261v0>, <shard_leader:9>, <shard_nodes:[9, 9, 9, 9, 9, 9, 9, 9, 9]>, Shard: <channel_name:by-dev-rootcoord-dml_1_439740485395808261v1>, <shard_leader:9>, <shard_nodes:[9, 9, 9, 9, 9, 9, 9, 9, 9]>]>
- Group: <group_id:439740487332528130>, <group_nodes:(1,)>, <shards:[Shard: <channel_name:by-dev-rootcoord-dml_0_439740485395808261v0>, <shard_leader:1>, <shard_nodes:[1, 1, 1, 1, 1, 1, 1, 1, 1]>, Shard: <channel_name:by-dev-rootcoord-dml_1_439740485395808261v1>, <shard_leader:1>, <shard_nodes:[1, 1, 1, 1, 1, 1, 1, 1, 1]>]> (base.py:448)
# segments
[segmentID: 439740485397808858
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 175008
state: Sealed
nodeIds: 9
nodeIds: 1
, segmentID: 439740485398209064
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 124956
state: Sealed
nodeIds: 9
nodeIds: 1
, segmentID: 439740485398609213
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 349574
state: Sealed
nodeIds: 9
nodeIds: 1
, segmentID: 439740485397408734
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 375280
state: Sealed
nodeIds: 9
nodeIds: 1
, segmentID: 439740485398409088
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 99972
state: Sealed
nodeIds: 9
nodeIds: 1
, segmentID: 439740485397808879
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 175180
state: Sealed
nodeIds: 9
nodeIds: 1
, segmentID: 439740485398609214
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 350302
state: Sealed
nodeIds: 9
nodeIds: 1
, segmentID: 439740485397408731
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 349728
state: Sealed
nodeIds: 9
nodeIds: 1
]
replicas and segments info afetr scale and transfer from default rg to RG_0
:
- Group: <group_id:439740487332528129>, <group_nodes:(9, 10)>, <shards:[Shard: <channel_name:by-dev-rootcoord-dml_0_439740485395808261v0>, <shard_leader:9>, <shard_nodes:[9, 9, 9, 9, 10, 10, 10, 10]>, Shard: <channel_name:by-dev-rootcoord-dml_1_439740485395808261v1>, <shard_leader:9>, <shard_nodes:[9, 9, 9, 9, 10, 10, 10, 10]>]>
- Group: <group_id:439740487332528130>, <group_nodes:(4, 1)>, <shards:[Shard: <channel_name:by-dev-rootcoord-dml_0_439740485395808261v0>, <shard_leader:1>, <shard_nodes:[1, 1, 1, 4, 4, 4, 4]>, Shard: <channel_name:by-dev-rootcoord-dml_1_439740485395808261v1>, <shard_leader:1>, <shard_nodes:[1, 1, 1, 4, 4, 4, 4]>]> (base.py:448)
[2023-02-27 05:53:43,237 - INFO - fouram]: [Base] collection fouram_0jJpzetZ query segment info: [segmentID: 439740485398609326
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 649692
state: Sealed
nodeIds: 4
nodeIds: 9
, segmentID: 439740485398609214
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 350302
state: Sealed
nodeIds: 4
nodeIds: 10
, segmentID: 439740485397408734
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 375280
state: Sealed
nodeIds: 4
nodeIds: 10
, segmentID: 439740485397808879
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 175180
state: Sealed
nodeIds: 4
nodeIds: 10
, segmentID: 439740485398609213
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 349574
state: Sealed
nodeIds: 9
nodeIds: 1
, segmentID: 439740485398409088
collectionID: 439740485395808261
partitionID: 439740485395808262
num_rows: 99972
state: Sealed
nodeIds: 9
nodeIds: 1
]
Is there an existing issue for this?
Environment
Current Behavior
There are two replicas of this collection in two resource groups, one of which is unavailable due to scale-in querynode and insufficient memory, and the other is available. But the search returns exception
RG_0
and transfer 2 querynodes intoRG_0
from default RGExpected Behavior
The search is expected to succeed because one of the replicas is available
Steps To Reproduce
Milvus Log
You can search for logs in Loki. if you want the raw log file for some components, tell me. cluster: devops namespace: chaos-testing server pods:
Anything else?
No response