Open wangting0128 opened 1 month ago
It seems that the difference of flush policy make the final segment size different. And the work load is too high, the root cause may be the scheduling policy of querynode. Most cost is the queue time but not the execution time.
may be related to #36761
/unassign
@wangting0128 please help to rerun these test with commit f0f5147aefe581b87e30b7b144dc801d7926322e
.
thx!
36761 is merged, but I do not make sure that it fix these issue. @wangting0128 please help to rerun these test with commit
f0f5147aefe581b87e30b7b144dc801d7926322e
. thx!
argo task: fouramf-kmshk image: master-20241014-d566b0ce-amd64
server:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
verify-36804-rt-etcd-0 1/1 Running 0 9h 10.104.18.119 4am-node25 <none> <none>
verify-36804-rt-etcd-1 1/1 Running 0 9h 10.104.34.235 4am-node37 <none> <none>
verify-36804-rt-etcd-2 1/1 Running 0 9h 10.104.19.59 4am-node28 <none> <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-2xctx 1/1 Running 2 (9h ago) 9h 10.104.32.154 4am-node39 <none> <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-6mxxn 1/1 Running 2 (9h ago) 9h 10.104.4.149 4am-node11 <none> <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-7x59m 1/1 Running 1 (9h ago) 9h 10.104.14.14 4am-node18 <none> <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-7xz8j 1/1 Running 2 (9h ago) 9h 10.104.17.25 4am-node23 <none> <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-c5gt4 1/1 Running 2 (9h ago) 9h 10.104.15.61 4am-node20 <none> <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-mz7p6 1/1 Running 2 (9h ago) 9h 10.104.18.115 4am-node25 <none> <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-r8nzw 1/1 Running 2 (9h ago) 9h 10.104.25.133 4am-node30 <none> <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-vdsww 1/1 Running 2 (9h ago) 9h 10.104.9.182 4am-node14 <none> <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-xwdv4 1/1 Running 1 (9h ago) 9h 10.104.13.126 4am-node16 <none> <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-z6k7m 1/1 Running 2 (9h ago) 9h 10.104.19.52 4am-node28 <none> <none>
verify-36804-rt-milvus-indexnode-c646989df-msbkh 1/1 Running 2 (9h ago) 9h 10.104.4.150 4am-node11 <none> <none>
verify-36804-rt-milvus-indexnode-c646989df-qm5xp 1/1 Running 2 (9h ago) 9h 10.104.34.230 4am-node37 <none> <none>
verify-36804-rt-milvus-indexnode-c646989df-rwrl6 1/1 Running 0 9h 10.104.14.16 4am-node18 <none> <none>
verify-36804-rt-milvus-indexnode-c646989df-sbp24 1/1 Running 2 (9h ago) 9h 10.104.5.14 4am-node12 <none> <none>
verify-36804-rt-milvus-mixcoord-c7cc55b48-xfq49 1/1 Running 1 (9h ago) 9h 10.104.14.17 4am-node18 <none> <none>
verify-36804-rt-milvus-proxy-5cb97c6d46-rdb62 1/1 Running 3 (9h ago) 9h 10.104.4.151 4am-node11 <none> <none>
verify-36804-rt-milvus-querynode-7c464946df-86wj5 1/1 Running 2 (9h ago) 9h 10.104.9.183 4am-node14 <none> <none>
verify-36804-rt-milvus-querynode-7c464946df-hqdm5 1/1 Running 2 (9h ago) 9h 10.104.4.152 4am-node11 <none> <none>
verify-36804-rt-milvus-querynode-7c464946df-krb4q 1/1 Running 0 9h 10.104.14.18 4am-node18 <none> <none>
verify-36804-rt-milvus-querynode-7c464946df-s5277 1/1 Running 2 (9h ago) 9h 10.104.15.62 4am-node20 <none> <none>
verify-36804-rt-milvus-querynode-7c464946df-wz7nb 1/1 Running 1 (9h ago) 9h 10.104.20.147 4am-node22 <none> <none>
verify-36804-rt-milvus-streamingnode-59dbdd5fc8-brg85 1/1 Running 3 (9h ago) 9h 10.104.4.147 4am-node11 <none> <none>
verify-36804-rt-minio-0 1/1 Running 0 9h 10.104.20.149 4am-node22 <none> <none>
verify-36804-rt-minio-1 1/1 Running 0 9h 10.104.34.236 4am-node37 <none> <none>
verify-36804-rt-minio-2 1/1 Running 0 9h 10.104.17.27 4am-node23 <none> <none>
verify-36804-rt-minio-3 1/1 Running 0 9h 10.104.19.60 4am-node28 <none> <none>
verify-36804-rt-pulsar-bookie-0 1/1 Running 0 9h 10.104.30.6 4am-node38 <none> <none>
verify-36804-rt-pulsar-bookie-1 1/1 Running 0 9h 10.104.34.239 4am-node37 <none> <none>
verify-36804-rt-pulsar-bookie-2 1/1 Running 0 9h 10.104.17.33 4am-node23 <none> <none>
verify-36804-rt-pulsar-bookie-init-jtskk 0/1 Completed 0 9h 10.104.4.148 4am-node11 <none> <none>
verify-36804-rt-pulsar-broker-0 1/1 Running 0 9h 10.104.14.15 4am-node18 <none> <none>
verify-36804-rt-pulsar-proxy-0 1/1 Running 0 9h 10.104.5.15 4am-node12 <none> <none>
verify-36804-rt-pulsar-pulsar-init-hnnh5 0/1 Completed 0 9h 10.104.14.12 4am-node18 <none> <none>
verify-36804-rt-pulsar-recovery-0 1/1 Running 0 9h 10.104.14.13 4am-node18 <none> <none>
verify-36804-rt-pulsar-zookeeper-0 1/1 Running 0 9h 10.104.19.55 4am-node28 <none> <none>
verify-36804-rt-pulsar-zookeeper-1 1/1 Running 0 9h 10.104.24.66 4am-node29 <none> <none>
verify-36804-rt-pulsar-zookeeper-2 1/1 Running 0 9h 10.104.21.206 4am-node24 <none> <none>
client log: hybrid_search request timeout
@chyezh
After Flushing Policy Fixed,
The Milvus With Streaming Service will finally generate 1.11k sealed segments while the milvus without streaming service will finally generate 2k sealed segments. So the segments in milvus with streaming service have double size comparing with milvus without streaming service. It's the major difference between two test case.
Streaming:
No Streaming:
Milvus With Streaming Service's message consumer works correctly, so it's not introduced by streaming service.
[2024/10/14 19:14:50.024 +00:00] [DEBUG] [pipeline/insert_node.go:80] ["pipeline fetch insert msg"] [collectionID=453223040984548132] [segmentID=453223041097817164] [insertRowNum=1] [timestampMin=453229488314253321] [timestampMax=453229488314253321]
The timestamp `453229488314253321` is `2024-10-14 19:14:49.773`
Found that the some request still wait for tsafe for long time whether using streaming or not. and ProcessInsert Delay increase periodically:
No Streaming:
Streaming:
Found that inserting a new message cost 19min when creating a new segment.
[2024/10/14 19:14:50.024 +00:00] [DEBUG] [pipeline/insert_node.go:80] ["pipeline fetch insert msg"] [collectionID=453223040984548132] [segmentID=453223041097817164] [insertRowNum=1] [timestampMin=453229488314253321] [timestampMax=453229488314253321]
...
[2024/10/14 19:35:09.020 +00:00] [INFO] [delegator/delegator_data.go:341] ["add growing segments to delegator"] [collectionID=453223040984548132] [channel=by-dev-rootcoord-dml_3_453223040984548132v3] [replicaID=453223041145765889] [segmentIDs="[453223041097817164]"]
Found that the insert operation is blocked by the acquirisition of mutex growingSegmentLock
.
And these mutex is also acquired by ReleaseSegments
.
Release operation of segment 453223041097816587
use 1h3m
.
And the release operation will be blocked because of distribution expiration.
[2024/10/14 18:32:37.675 +00:00] [INFO] [querynodev2/services.go:545] ["received release segment request"] [traceID=be09552d1e1e59448c7da62ffb1a9f5f] [collectionID=453223040984548132] [shard=by-dev-rootcoord-dml_3_453223040984548132v3] [segmentIDs="[453223041097816587]"] [currentNodeID=6] [scope=Streaming] [needTransfer=true]
...
[2024/10/14 19:35:09.020 +00:00] [INFO] [segments/segment.go:1467] ["delete segment from memory"] [traceID=be09552d1e1e59448c7da62ffb1a9f5f] [collectionID=453223040984548132] [partitionID=453223040984548984] [segmentID=453223041097816587] [segmentType=Growing] [insertCount=1]
tsafe timeout should be fixed by pr #36997 Another difference found:
After flushing policy fixed: Milvus Streaming Service will continously generate flush segment, and compaction will execute more frequently and fluently.
The flush operation is triggered by policy: binlog file number.
So the streaming service performs more compaction and handoff operation than the milvus without streaming service. Reach the less segment counts at final about 1000 L1 sealed segments.
Meanwhile, milvus without streaming service don't generate flushed segment fluently.
It performs less compaction and handoff operation, reach the segment counts about 1750 L1 sealed segments at last.
So the milvus without streaming service encounter less race condition than the milvus with streaming service when handing off, and performs more better RT.
@wangting0128 please retry the test at commit ac178eeea569cb5c1f86e57ebe448ac4e15f4cb4
.
thx.
At latest commit, tsafe problem is fixed. But the search latency is still high.
Found that f43527e
increase the RT.
03a78ec
keep the RT.
Found that scalar search latency increase:
Found that scalar search latency increase:
this is comapred master with what? could this be impacted by null?
Found that scalar search latency increase:
this is comapred master with what? could this be impacted by null?
This is a comparison of the deployment of instances with and without streamingNode on the same case.
This is a comparison of the deployment of instances with and without streamingNode on the same case.
Nope, two tests both ran on a milvus with different commit without streaming enabled.
tsafe timeout should be fixed by pr #36997 Another difference found:
After flushing policy fixed: Milvus Streaming Service will continously generate flush segment, and compaction will execute more frequently and fluently. The flush operation is triggered by policy: binlog file number. So the streaming service performs more compaction and handoff operation than the milvus without streaming service. Reach the less segment counts at final about 1000 L1 sealed segments.
Meanwhile, milvus without streaming service don't generate flushed segment fluently. It performs less compaction and handoff operation, reach the segment counts about 1750 L1 sealed segments at last.
So the milvus without streaming service encounter less race condition than the milvus with streaming service when handing off, and performs more better RT.
is there a special reason why so many bin logs is actaully generated?
is there a special reason why so many bin logs is actaully generated?
There's a binlog-num-based flush policy in milvus. At previous implementation:
It has been fixed by #37037, milvus with streaming has kept consistency with milvus without streaming.
Is there an existing issue for this?
Environment
Current Behavior
argo task: fouramf-streaming-node-corn-1728615600 test case name: test_bitmap_locust_dql_dml_partition_key_cluster
server:
enabled streamingNode👇
disabled streamingNode👇 release name: fouramf-bitmap-scenes-q27w2-7
client log:
Expected Behavior
No response
Steps To Reproduce
Milvus Log
No response
Anything else?
test result: