Open zhuwenxing opened 3 weeks ago
This issue also occurred on the query of text match.
This looks like a common issue, I've tried querying with a range ID, and it also results in empty results for a short period of time.
@aoiasd
are we loading stats asyncly? I think we need to wait all stats loaded before we can serve
@aoiasd
are we loading stats asyncly? I think we need to wait all stats loaded before we can serve
No, loading stats was part of load segment, if segment in target, it stats should load completed and can serve.
Image tag: master-20241107-f813fb45-amd64 Using the image tag from 11/07, the test was conducted in the same way, and the behavior changed to:
In both the 11-01 and this test, the search interface did not have a timeout parameter set, and it is unclear why the search interface reported an error directly in the 11-01 test, while this time it was delayed.
Image tag: master-20241107-f813fb45-amd64 Using the image tag from 11/07, the test was conducted in the same way, and the behavior changed to:
- The entire process did not fail
- However, during the time period when the pod was killed, the search latency increased from 300ms to 1 minute.
In both the 11-01 and this test, the search interface did not have a timeout parameter set, and it is unclear why the search interface reported an error directly in the 11-01 test, while this time it was delayed.
I think this is the reasonable behaviour. milvus retry and wait for segments to be reloaded.
The only problem is why service is recovered slower than expected?
/assign @zhuwenxing /unassign @czs007
@liliu-z I think we still need a fix for the slow recovery issue? /assign @aoiasd could you please take a look?
/unassign @zhuwenxing
Querynode reload all segment just use 15s,but search task and pipeline will wait other 30s before start run.
Not a unique problem for bm25 or text match, but for all querynode recovery.
Knowhere latency increased meaning high CPU usage during the loading progress similar to https://github.com/milvus-io/milvus/issues/37796
Search latency in worker increased but delegator works well, may relate with load L0 on worker
remoteload was reverted, please help to rerun the test and update the results. @zhuwenxing
/assign @zhuwenxing
Is there an existing issue for this?
Environment
Current Behavior
Error was reported at the beginning of the recovery process
Error occurred in the intermediate phase
Return empty result
Return normal result
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
cluster: 4am ns: chaos-testing pod info
Anything else?
No response