Closed zhuwenxing closed 4 days ago
/assign @XuanYang-cn
mark this as urgent
-> Flush took 1m57s, FlushTask are queuing in proxy's scheduler, also Some DataNodes have long running FlushChannels RPC(36s), might be the cause of the queuing.
inQueue time too long
-> Flush took 1m57s, FlushTask are queuing in proxy's scheduler, also Some DataNodes have long running FlushChannels RPC(36s), might be the cause of the queuing.
inQueue time too long
is there a lock there?
this seems to be a critical problem
the only time to block flush channels is here, is this the key place the flush channel is blocked?
I'm running tests, and tests are blocked by #37568
https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/19421/pipeline/355 no reproduce But there're several long runing FlushChannel rpcs, looking into it.
I'll merge the logs enhancements and see if this's reproducable.
Replacing locks with metrics logs to give it a go
Catch a long holding locks in memoryCheck when DN memory water level is very low
@XuanYang-cn 1s seems to not long enough? maybe there is some queue blocked ?
@xiaofan-luan The metric_mutex has some concurrency issues and most metrics results are zero, I'm working on fixing that.
Tracing shows the failed and long running flush mostly are blocked at FlushChannels, this is a recent test result. And there're no queue in DN. I believe it's waiting to acquire lock, and I'm trying to catch it.
The above tests involving metric_mutex are all wrong, there're concurrency issues of RLock that may case resouce leak and unreleasing of Locks forever. After 2 days of fixes, the concurrency problem seems only be able to be fixed by introducing another lock :(
So I disgarded the fixes of metric_mutex, added the metrics in the FlushChannel and memCheck, and try to use the metric directly, I'm givng it a last shot today.
Since this's not reproducing lately, I suggest to mark it non-critical. cc @zhuwenxing , correct me if I'm wrong
The above tests involving metric_mutex are all wrong, there're concurrency issues of RLock that may case resouce leak and unreleasing of Locks forever. After 2 days of fixes, the concurrency problem seems only be able to be fixed by introducing another lock :(
So I disgarded the fixes of metric_mutex, added the metrics in the FlushChannel and memCheck, and try to use the metric directly, I'm givng it a last shot today.
Since this's not reproducing lately, I suggest to mark it non-critical. cc @zhuwenxing , correct me if I'm wrong
We can reevaluate the urgency tmr. Let's get more clues first
Not reproducing long running FlushChannels and Flush https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/19478/pipeline https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/19479/pipeline
Locks of MemCheck and FlushChannel do have some competitive.
Detailed reason: When DN meets heavey work load(TimeTickLag reaches >= 3s), 100 channels
I'll make a patch to change the memoryCheck Lock to RLock, this won't influence Flush any more. However it might be the same for flushing a new created collection. Since RegisterChannel
is Lock, which will be locked by long running memoryCheck.
@XuanYang-cn is this ready to verify the issue now?
/assign @zhuwenxing /unassign Please help veirfy, it's not that easy to reproduce, plz try multiple times, THX
should be fix by #37525 could you please verify? /assign @zhuwenxing
They key here is we need to remove many locks. For example the MemorySize of writeBuffer is locked by BufferData
this need to be a bunch of atomic of cocurrent data strcuture.
@XuanYang-cn could you help on this?
it always good to use concurrent data structure until there is a must to use lock. And remember it is always dangerous to hold a lock into another lock or on a rpc
@XuanYang-cn
For further improvement, you can open an enhancement. I would like to close this issue due to not being reproduced recently
Oops, it still reproduced.
etcd pod kill chaos test
image tag master-20241118-f2a2fd68-amd64
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/19513/pipeline log: artifacts-etcd-pod-kill-19513-server-logs.tar.gz
[2024-11-18T08:23:29.435Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__i43xi44m] - pytest_assume.plugin.FailedAssumption:
[2024-11-18T08:23:29.435Z] 1 Failed Assumptions:
[2024-11-18T08:23:29.435Z]
[2024-11-18T08:23:29.435Z] chaos_commons.py:124: AssumptionFailure
[2024-11-18T08:23:29.435Z] >> pytest.assume(
[2024-11-18T08:23:29.435Z] AssertionError: Expect Succ: Op.flush succ rate 1.0, total: 2, average time: 76.7162
[2024-11-18T08:23:29.435Z] assert False
[2024-11-18T08:23:29.435Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__S8VFOPMg] - pytest_assume.plugin.FailedAssumption:
[2024-11-18T08:23:29.435Z] 1 Failed Assumptions:
[2024-11-18T08:23:29.435Z]
[2024-11-18T08:23:29.435Z] chaos_commons.py:124: AssumptionFailure
[2024-11-18T08:23:29.435Z] >> pytest.assume(
[2024-11-18T08:23:29.435Z] AssertionError: Expect Succ: Op.flush succ rate 1.0, total: 2, average time: 108.7892
[2024-11-18T08:23:29.435Z] assert False
[2024-11-18T08:23:29.435Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__rntGoS3n] - pytest_assume.plugin.FailedAssumption:
[2024-11-18T08:23:29.435Z] 1 Failed Assumptions:
[2024-11-18T08:23:29.435Z]
[2024-11-18T08:23:29.435Z] chaos_commons.py:124: AssumptionFailure
[2024-11-18T08:23:29.435Z] >> pytest.assume(
[2024-11-18T08:23:29.435Z] AssertionError: Expect Succ: Op.flush succ rate 1.0, total: 2, average time: 100.2917
[2024-11-18T08:23:29.435Z] assert False
[2024-11-18T08:23:29.435Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__IGV8i564] - pytest_assume.plugin.FailedAssumption:
[2024-11-18T08:23:29.435Z] 1 Failed Assumptions:
[2024-11-18T08:23:29.435Z]
[2024-11-18T08:23:29.435Z] chaos_commons.py:124: AssumptionFailure
[2024-11-18T08:23:29.435Z] >> pytest.assume(
[2024-11-18T08:23:29.435Z] AssertionError: Expect Succ: Op.flush succ rate 1.0, total: 2, average time: 97.3172
[2024-11-18T08:23:29.435Z] assert False
[2024-11-18T08:23:29.435Z] FAILED testcases/test_concurrent_operation.py::TestOperations::test_operations[Checker__Nh520hRa] - pytest_assume.plugin.FailedAssumption:
[2024-11-18T08:23:29.435Z] 1 Failed Assumptions:
[2024-11-18T08:23:29.435Z]
[2024-11-18T08:23:29.435Z] chaos_commons.py:124: AssumptionFailure
[2024-11-18T08:23:29.435Z] >> pytest.assume(
[2024-11-18T08:23:29.435Z] AssertionError: Expect Succ: Op.flush succ rate 1.0, total: 2, average time: 99.7081
[2024-11-18T08:23:29.435Z] assert False
[2024-11-18T08:23:29.435Z] ================= 5 failed, 15 warnings in 1077.66s (0:17:57) ==================
It's locked by other things, the previous fix cannot fix the lock problem, working on it
/assign @XuanYang-cn /unassign
seems like a stable issue
image tag master-20241119-fceff6ed-amd64
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-cron/detail/chaos-test-kafka-cron/18575/pipeline
log:
artifacts-standalone-pod-kill-18575-server-logs.tar.gz
/assign @zhuwenxing /unassign Please help verify
not reproduced in master-20241122-cfa1f1f1-amd64 success job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/19557/pipeline/356
Is there an existing issue for this?
Environment
Current Behavior
Expected Behavior
flush should be finished in 10 seconds
Steps To Reproduce
No response
Milvus Log
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/19388/pipeline log: artifacts-etcd-followers-pod-failure-19388-server-logs.tar.gz
Anything else?
This issue occurred after the etcd-follower pod failure, but this chaos did not affect the system, so I think the slow flush should be caused by the system itself, rather than the chaos that occurred.