Closed zhuwenxing closed 1 month ago
/assign @chyezh PTAL
[2024/09/05 09:00:55.774 +00:00] [INFO] [datacoord/services.go:641] ["DropVChannel plan to remove"] [traceID=0254935119240ad2ea375eecacf7dad4] [channel=by-dev-rootcoord-dml_0_452336160083542224v0]
[2024/09/05 09:00:55.775 +00:00] [WARN] [datacoord/services.go:644] ["DropVChannel failed to ReleaseAndRemove"] [traceID=0254935119240ad2ea375eecacf7dad4] [channel=by-dev-rootcoord-dml_0_452336160083542224v0] [error="fail to find matching nodeID: 13 with channelName: by-dev-rootcoord-dml_0_452336160083542224v0"]
/unassign
@zhuwenxing please verify it with latest master image
@chyezh
all pods can be healthy after the pod kill chaos test. but many test cases failed due to load and flush failure after recovery.
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-straming-node-cron/detail/chaos-test-straming-node-cron/8/pipeline log: artifacts-streamingnode-pod-kill-8-server-logs.tar.gz
There's an error-handling bug at Txn of streaming service. Cause a dead lock forever.
defer func() {
if err != nil {
txnSession.AddNewMessageFail()
}
// perform keepalive for the transaction session if append success.
txnSession.AddNewMessageDoneAndKeepalive(msg.TimeTick())
}()
verified and fixed in master-20241008-ca9842e4-amd64
Is there an existing issue for this?
Environment
Current Behavior
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-straming-node-cron/detail/chaos-test-straming-node-cron/3/pipeline
log: artifacts-streamingnode-pod-kill-3-server-logs.tar.gz
pod info
Anything else?
No response