[Bug]: The flush time of Milvus is very long, ranging from 70s to 110s.

zhuwenxing commented 2 weeks ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version:master-20241106-b3de4b0e-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2024-11-06T10:13:02.687Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__KRqSt70I] - pytest_assume.plugin.FailedAssumption: 

[2024-11-06T10:13:02.687Z] 1 Failed Assumptions:

[2024-11-06T10:13:02.687Z] 

[2024-11-06T10:13:02.687Z] chaos_commons.py:124: AssumptionFailure

[2024-11-06T10:13:02.687Z] >>   pytest.assume(

[2024-11-06T10:13:02.687Z] AssertionError: Expect Succ: Op.flush succ rate 1.0, total: 2, average time: 75.4042

[2024-11-06T10:13:02.687Z] assert False

[2024-11-06T10:13:02.687Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__NWHr1eht] - pytest_assume.plugin.FailedAssumption: 

[2024-11-06T10:13:02.687Z] 1 Failed Assumptions:

[2024-11-06T10:13:02.687Z] 

[2024-11-06T10:13:02.687Z] chaos_commons.py:124: AssumptionFailure

[2024-11-06T10:13:02.687Z] >>   pytest.assume(

[2024-11-06T10:13:02.687Z] AssertionError: Expect Succ: Op.flush succ rate 1.0, total: 1, average time: 68.9599

[2024-11-06T10:13:02.687Z] assert False

[2024-11-06T10:13:02.687Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__7iI3MVio] - pytest_assume.plugin.FailedAssumption: 

[2024-11-06T10:13:02.687Z] 1 Failed Assumptions:

[2024-11-06T10:13:02.687Z] 

[2024-11-06T10:13:02.687Z] chaos_commons.py:124: AssumptionFailure

[2024-11-06T10:13:02.687Z] >>   pytest.assume(

[2024-11-06T10:13:02.687Z] AssertionError: Expect Succ: Op.flush succ rate 1.0, total: 1, average time: 101.9092

[2024-11-06T10:13:02.687Z] assert False

[2024-11-06T10:13:02.687Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__fV1KEsWq] - pytest_assume.plugin.FailedAssumption: 

[2024-11-06T10:13:02.687Z] 1 Failed Assumptions:

[2024-11-06T10:13:02.687Z] 

[2024-11-06T10:13:02.687Z] chaos_commons.py:124: AssumptionFailure

[2024-11-06T10:13:02.687Z] >>   pytest.assume(

[2024-11-06T10:13:02.687Z] AssertionError: Expect Succ: Op.flush succ rate 1.0, total: 2, average time: 87.3646

[2024-11-06T10:13:02.687Z] assert False

[2024-11-06T10:13:02.687Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__vImac9iA] - pytest_assume.plugin.FailedAssumption: 

[2024-11-06T10:13:02.687Z] 1 Failed Assumptions:

[2024-11-06T10:13:02.687Z] 

[2024-11-06T10:13:02.687Z] chaos_commons.py:124: AssumptionFailure

[2024-11-06T10:13:02.687Z] >>   pytest.assume(

[2024-11-06T10:13:02.687Z] AssertionError: Expect Succ: Op.flush succ rate 1.0, total: 2, average time: 110.0239

Expected Behavior

flush should be finished in 10 seconds

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/19388/pipeline log: artifacts-etcd-followers-pod-failure-19388-server-logs.tar.gz

Anything else?

This issue occurred after the etcd-follower pod failure, but this chaos did not affect the system, so I think the slow flush should be caused by the system itself, rather than the chaos that occurred.

zhuwenxing commented 2 weeks ago

/assign @XuanYang-cn

xiaofan-luan commented 2 weeks ago

mark this as urgent

XuanYang-cn commented 2 weeks ago

-> Flush took 1m57s, FlushTask are queuing in proxy's scheduler, also Some DataNodes have long running FlushChannels RPC(36s), might be the cause of the queuing.

inQueue time too long

xiaofan-luan commented 2 weeks ago

-> Flush took 1m57s, FlushTask are queuing in proxy's scheduler, also Some DataNodes have long running FlushChannels RPC(36s), might be the cause of the queuing.

inQueue time too long

is there a lock there?

xiaofan-luan commented 2 weeks ago

this seems to be a critical problem

xiaofan-luan commented 2 weeks ago

the only time to block flush channels is here, is this the key place the flush channel is blocked?

XuanYang-cn commented 2 weeks ago

I'm running tests, and tests are blocked by #37568

XuanYang-cn commented 2 weeks ago

https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/19421/pipeline/355 no reproduce But there're several long runing FlushChannel rpcs, looking into it.

XuanYang-cn commented 2 weeks ago

https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/19437/pipeline/

XuanYang-cn commented 2 weeks ago

I'll merge the logs enhancements and see if this's reproducable.

XuanYang-cn commented 2 weeks ago

Replacing locks with metrics logs to give it a go

XuanYang-cn commented 2 weeks ago

Catch a long holding locks in memoryCheck when DN memory water level is very low

xiaofan-luan commented 1 week ago

@XuanYang-cn 1s seems to not long enough? maybe there is some queue blocked ?

XuanYang-cn commented 1 week ago

@xiaofan-luan The metric_mutex has some concurrency issues and most metrics results are zero, I'm working on fixing that.

Tracing shows the failed and long running flush mostly are blocked at FlushChannels, this is a recent test result. And there're no queue in DN. I believe it's waiting to acquire lock, and I'm trying to catch it. img_v3_02gj_109851f9-2e8b-490e-a8f1-cba78582c85g

XuanYang-cn commented 1 week ago

The above tests involving metric_mutex are all wrong, there're concurrency issues of RLock that may case resouce leak and unreleasing of Locks forever. After 2 days of fixes, the concurrency problem seems only be able to be fixed by introducing another lock :(

So I disgarded the fixes of metric_mutex, added the metrics in the FlushChannel and memCheck, and try to use the metric directly, I'm givng it a last shot today.

Since this's not reproducing lately, I suggest to mark it non-critical. cc @zhuwenxing , correct me if I'm wrong

liliu-z commented 1 week ago

The above tests involving metric_mutex are all wrong, there're concurrency issues of RLock that may case resouce leak and unreleasing of Locks forever. After 2 days of fixes, the concurrency problem seems only be able to be fixed by introducing another lock :(

So I disgarded the fixes of metric_mutex, added the metrics in the FlushChannel and memCheck, and try to use the metric directly, I'm givng it a last shot today.

Since this's not reproducing lately, I suggest to mark it non-critical. cc @zhuwenxing , correct me if I'm wrong

We can reevaluate the urgency tmr. Let's get more clues first

XuanYang-cn commented 1 week ago

Not reproducing long running FlushChannels and Flush https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/19478/pipeline https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/19479/pipeline

Locks of MemCheck and FlushChannel do have some competitive.

XuanYang-cn commented 1 week ago

Reproduced!

https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/19480/pipeline/

XuanYang-cn commented 1 week ago

Detailed reason: When DN meets heavey work load(TimeTickLag reaches >= 3s), 100 channels

FlushChannels is Locked by memoryCheck in WriteBufferManager
- memoryCheck is Locked by MemorySize of writeBuffers
- MemorySize of writeBuffer is locked by BufferData
- BufferData is locked by SyncMgr submit keylock(2 sync tasks of a same segment)

I'll make a patch to change the memoryCheck Lock to RLock, this won't influence Flush any more. However it might be the same for flushing a new created collection. Since RegisterChannel is Lock, which will be locked by long running memoryCheck.

yanliang567 commented 1 week ago

@XuanYang-cn is this ready to verify the issue now?

XuanYang-cn commented 1 week ago

/assign @zhuwenxing /unassign Please help veirfy, it's not that easy to reproduce, plz try multiple times, THX

congqixia commented 1 week ago

should be fix by #37525 could you please verify? /assign @zhuwenxing

xiaofan-luan commented 1 week ago

They key here is we need to remove many locks. For example the MemorySize of writeBuffer is locked by BufferData

this need to be a bunch of atomic of cocurrent data strcuture.

xiaofan-luan commented 1 week ago

@XuanYang-cn could you help on this?

it always good to use concurrent data structure until there is a must to use lock. And remember it is always dangerous to hold a lock into another lock or on a rpc

zhuwenxing commented 1 week ago

@XuanYang-cn

For further improvement, you can open an enhancement. I would like to close this issue due to not being reproduced recently

zhuwenxing commented 1 week ago

Oops, it still reproduced.

etcd pod kill chaos test image tag master-20241118-f2a2fd68-amd64

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/19513/pipeline log: artifacts-etcd-pod-kill-19513-server-logs.tar.gz


[2024-11-18T08:23:29.435Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__i43xi44m] - pytest_assume.plugin.FailedAssumption: 

[2024-11-18T08:23:29.435Z] 1 Failed Assumptions:

[2024-11-18T08:23:29.435Z] 

[2024-11-18T08:23:29.435Z] chaos_commons.py:124: AssumptionFailure

[2024-11-18T08:23:29.435Z] >>   pytest.assume(

[2024-11-18T08:23:29.435Z] AssertionError: Expect Succ: Op.flush succ rate 1.0, total: 2, average time: 76.7162

[2024-11-18T08:23:29.435Z] assert False

[2024-11-18T08:23:29.435Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__S8VFOPMg] - pytest_assume.plugin.FailedAssumption: 

[2024-11-18T08:23:29.435Z] 1 Failed Assumptions:

[2024-11-18T08:23:29.435Z] 

[2024-11-18T08:23:29.435Z] chaos_commons.py:124: AssumptionFailure

[2024-11-18T08:23:29.435Z] >>   pytest.assume(

[2024-11-18T08:23:29.435Z] AssertionError: Expect Succ: Op.flush succ rate 1.0, total: 2, average time: 108.7892

[2024-11-18T08:23:29.435Z] assert False

[2024-11-18T08:23:29.435Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__rntGoS3n] - pytest_assume.plugin.FailedAssumption: 

[2024-11-18T08:23:29.435Z] 1 Failed Assumptions:

[2024-11-18T08:23:29.435Z] 

[2024-11-18T08:23:29.435Z] chaos_commons.py:124: AssumptionFailure

[2024-11-18T08:23:29.435Z] >>   pytest.assume(

[2024-11-18T08:23:29.435Z] AssertionError: Expect Succ: Op.flush succ rate 1.0, total: 2, average time: 100.2917

[2024-11-18T08:23:29.435Z] assert False

[2024-11-18T08:23:29.435Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__IGV8i564] - pytest_assume.plugin.FailedAssumption: 

[2024-11-18T08:23:29.435Z] 1 Failed Assumptions:

[2024-11-18T08:23:29.435Z] 

[2024-11-18T08:23:29.435Z] chaos_commons.py:124: AssumptionFailure

[2024-11-18T08:23:29.435Z] >>   pytest.assume(

[2024-11-18T08:23:29.435Z] AssertionError: Expect Succ: Op.flush succ rate 1.0, total: 2, average time: 97.3172

[2024-11-18T08:23:29.435Z] assert False

[2024-11-18T08:23:29.435Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__Nh520hRa] - pytest_assume.plugin.FailedAssumption: 

[2024-11-18T08:23:29.435Z] 1 Failed Assumptions:

[2024-11-18T08:23:29.435Z] 

[2024-11-18T08:23:29.435Z] chaos_commons.py:124: AssumptionFailure

[2024-11-18T08:23:29.435Z] >>   pytest.assume(

[2024-11-18T08:23:29.435Z] AssertionError: Expect Succ: Op.flush succ rate 1.0, total: 2, average time: 99.7081

[2024-11-18T08:23:29.435Z] assert False

[2024-11-18T08:23:29.435Z] ================= 5 failed, 15 warnings in 1077.66s (0:17:57) ==================

XuanYang-cn commented 1 week ago

It's locked by other things, the previous fix cannot fix the lock problem, working on it

zhuwenxing commented 6 days ago

/assign @XuanYang-cn /unassign

zhuwenxing commented 6 days ago

seems like a stable issue image tag master-20241119-fceff6ed-amd64 failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/18575/pipeline log: artifacts-standalone-pod-kill-18575-server-logs.tar.gz

XuanYang-cn commented 5 days ago

/assign @zhuwenxing /unassign Please help verify

zhuwenxing commented 4 days ago

not reproduced in master-20241122-cfa1f1f1-amd64 success job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/19557/pipeline/356

milvus-io / milvus