[Bug]: [CI] test_bulk_insert_all_field_with_numpy timeout due to slow stats task

bigsheeper commented 2 weeks ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version:master
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

test_bulk_insert_all_field_with_numpy test timed out due to the slow execution of the stats task:

https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-37061/6/pipeline/94/

Key log: https://grafana-4am.zilliz.cc/explore?panes=%7B%22SLL%22:%7B%22datasource%22:%22vhI6Vw67k%22,%22queries%22:%5B%7B%22refId%22:%22B%22,%22expr%22:%22%7Bpod%3D~%5C%22md-37061-6-py-pr-milvus-.%2A%5C%22%7D%20%7C~%20%5C%22453426998487862690%5C%22%20%21~%20%5C%22sync%20import%5C%22%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22vhI6Vw67k%22%7D,%22editorMode%22:%22code%22%7D%5D,%22range%22:%7B%22from%22:%221729685610331%22,%22to%22:%221729688603589%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

bigsheeper commented 2 weeks ago

Stats tasks are executed sequentially, and a large number of queued stats tasks caused the test to timeout: qjp6U5xW4L

bigsheeper commented 2 weeks ago

/assign @xiaocai2333 /unassign

bigsheeper commented 2 weeks ago

maybe we need to increase the concurrency of task execution in the indexNode? @xiaocai2333

xiaofan-luan commented 2 weeks ago

@bigsheeper

stats task only takes 1 core?

xiaofan-luan commented 2 weeks ago

if we increase concurrency, what is the reasonable concurrency?

xiaocai2333 commented 1 week ago

In CI testing, due to the relatively small amount of test data, we can appropriately increase the task concurrency of the index node. For example, default it's 1.

indexNode:
      scheduler:
        buildParallel: 4

xiaofan-luan commented 1 week ago

I'm doubting on this. how to put this into production? by default it should be 1 or 4? If one index task can fully utilize the cpu in a short time, can we simply refine the scheduler to issue more index build task faster?

xiaocai2333 commented 1 week ago

The fundamental reason is that we introduced stats into the process, which lengthens the overall process time. I think this is reasonable. And increasing the concurrency level to 4 is only to ease the pressure on CI; a more reasonable approach might be to increase the timeout duration. Also, I don’t think increasing concurrency is suitable for production—while the stats task may not fully utilize the CPU, it does occupy a significant amount of memory(full segment size).

xiaocai2333 commented 1 week ago

In a production environment, we should increase the number of indexnodes.

bigsheeper commented 1 week ago

I'm doubting on this. how to put this into production? by default it should be 1 or 4? If one index task can fully utilize the cpu in a short time, can we simply refine the scheduler to issue more index build task faster?

@xiaofan-luan I believe we need a more effective scheduling strategy. Estimating the CPU and memory requirements for each task and then determine task concurrency accordingly. I will implement this while developing the bulkinsert pooling.

liliu-z commented 1 week ago

Noob Q, can we make the stats task concurrent?
If not, I guess it will not be longer than the index building. So we can simply start the next index building task right after the previous one w/o waiting for the stats task.

BTW, I think this is not a 2.5 blocking issue @xiaocai2333

xiaocai2333 commented 1 week ago

Noob Q, can we make the stats task concurrent?

If not, I guess it will not be longer than the index building. So we can simply start the next index building task right after the previous one w/o waiting for the stats task.

BTW, I think this is not a 2.5 blocking issue @xiaocai2333

Yes, this is not a blocking issue for version 2.5, but it might affect the CI pass rate. We've already increased the timeout, and it rarely occurs now. Additionally, @bigsheeper has optimized stats task, so once that PR(#37373 ) is merged, this should no longer be an issue.

milvus-io / milvus