Open bigsheeper opened 2 weeks ago
Stats tasks are executed sequentially, and a large number of queued stats tasks caused the test to timeout:
/assign @xiaocai2333 /unassign
maybe we need to increase the concurrency of task execution in the indexNode? @xiaocai2333
@bigsheeper
stats task only takes 1 core?
if we increase concurrency, what is the reasonable concurrency?
In CI testing, due to the relatively small amount of test data, we can appropriately increase the task concurrency of the index node. For example, default it's 1.
indexNode:
scheduler:
buildParallel: 4
I'm doubting on this. how to put this into production? by default it should be 1 or 4? If one index task can fully utilize the cpu in a short time, can we simply refine the scheduler to issue more index build task faster?
The fundamental reason is that we introduced stats into the process, which lengthens the overall process time. I think this is reasonable. And increasing the concurrency level to 4 is only to ease the pressure on CI; a more reasonable approach might be to increase the timeout duration. Also, I don’t think increasing concurrency is suitable for production—while the stats task may not fully utilize the CPU, it does occupy a significant amount of memory(full segment size).
In a production environment, we should increase the number of indexnodes.
I'm doubting on this. how to put this into production? by default it should be 1 or 4? If one index task can fully utilize the cpu in a short time, can we simply refine the scheduler to issue more index build task faster?
@xiaofan-luan I believe we need a more effective scheduling strategy. Estimating the CPU and memory requirements for each task and then determine task concurrency accordingly. I will implement this while developing the bulkinsert pooling.
BTW, I think this is not a 2.5 blocking issue @xiaocai2333
- Noob Q, can we make the stats task concurrent?
- If not, I guess it will not be longer than the index building. So we can simply start the next index building task right after the previous one w/o waiting for the stats task.
BTW, I think this is not a 2.5 blocking issue @xiaocai2333
Yes, this is not a blocking issue for version 2.5, but it might affect the CI pass rate. We've already increased the timeout, and it rarely occurs now. Additionally, @bigsheeper has optimized stats task, so once that PR(#37373 ) is merged, this should no longer be an issue.
Is there an existing issue for this?
Environment
Current Behavior
test_bulk_insert_all_field_with_numpy test timed out due to the slow execution of the stats task:
https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-37061/6/pipeline/94/
Key log: https://grafana-4am.zilliz.cc/explore?panes=%7B%22SLL%22:%7B%22datasource%22:%22vhI6Vw67k%22,%22queries%22:%5B%7B%22refId%22:%22B%22,%22expr%22:%22%7Bpod%3D~%5C%22md-37061-6-py-pr-milvus-.%2A%5C%22%7D%20%7C~%20%5C%22453426998487862690%5C%22%20%21~%20%5C%22sync%20import%5C%22%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22vhI6Vw67k%22%7D,%22editorMode%22:%22code%22%7D%5D,%22range%22:%7B%22from%22:%221729685610331%22,%22to%22:%221729688603589%22%7D%7D%7D&schemaVersion=1&orgId=1
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
No response