milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.01k stars 2.95k forks source link

[Bug]: After the query node is killed and restarted, a situation would occur where full-text search calls are successful but return empty results for a period of time. #37364

Open zhuwenxing opened 3 weeks ago

zhuwenxing commented 3 weeks ago

Is there an existing issue for this?

Environment

- Milvus version:master-20241031-4bac2eb1-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Error was reported at the beginning of the recovery process

[2024-11-01 14:47:50,420] wxzhu-macbook-pro/ERROR/pymilvus.decorators: RPC error: [search], <MilvusException: (code=901, message=stack trace: /workspace/source/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/workspace/source/internal/util/grpcclient/client.go:555 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
/workspace/source/internal/util/grpcclient/client.go:569 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/workspace/source/internal/distributed/querynode/client/client.go:90 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]
/workspace/source/internal/distributed/querynode/client/client.go:202 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search
/workspace/source/internal/proxy/task_search.go:777 github.com/milvus-io/milvus/internal/proxy.(*searchTask).searchShard
/workspace/source/internal/proxy/lb_policy.go:178 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1
/workspace/source/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do
/workspace/source/internal/proxy/lb_policy.go:147 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry
/workspace/source/internal/proxy/lb_policy.go:215 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute.func2: node not found)>, <Time:{'RPC start': '2024-11-01 14:47:26.117866', 'RPC error': '2024-11-01 14:47:50.418723'}>

Error occurred in the intermediate phase

[2024-11-01 14:47:50,485] wxzhu-macbook-pro/ERROR/pymilvus.decorators: RPC error: [search], <MilvusException: (code=503, message=failed to search: segment lacks[segment=453561308713263304]: channel not available[channel=full-text-search-benchmark-rootcoord-dml_15_453561308693329552v0])>, <Time:{'RPC start': '2024-11-01 14:47:50.443547', 'RPC error': '2024-11-01 14:47:50.485016'}>

Return empty result

[2024-11-01 14:48:08,610] wxzhu-macbook-pro/INFO/milvus_locust: Search result: []
[2024-11-01 14:48:08,612] wxzhu-macbook-pro/ERROR/milvus_locust: Search error: Search result is empty
[2024-11-01 14:48:08,614] wxzhu-macbook-pro/INFO/milvus_locust: Search result: []
[2024-11-01 14:48:08,615] wxzhu-macbook-pro/ERROR/milvus_locust: Search error: Search result is empty
[2024-11-01 14:48:08,617] wxzhu-macbook-pro/INFO/milvus_locust: Search result: []

Return normal result

[2024-11-01 14:48:19,467] wxzhu-macbook-pro/INFO/milvus_locust: Search result: ["id: 1730366895102733, distance: 39.030517578125, entity: {'id': 1730366895102733}", "id: 1730366870852086, distance: 39.030517578125, entity: {'id': 1730366870852086}", "id: 1730366507504258, distance: 37.625244140625, entity: {'id': 1730366507504258}", "id: 1730366495122008, distance: 37.625244140625, entity: {'id': 1730366495122008}", "id: 1730366354188392, distance: 37.56460189819336, entity: {'id': 1730366354188392}", "id: 1730366347409246, distance: 37.56460189819336, entity: {'id': 1730366347409246}", "id: 1730365783165875, distance: 37.06922149658203, entity: {'id': 1730365783165875}", "id: 1730365781460345, distance: 37.06922149658203, entity: {'id': 1730365781460345}", "id: 1730366836454170, distance: 36.67963409423828, entity: {'id': 1730366836454170}", "id: 1730366819561076, distance: 36.67963409423828, entity: {'id': 1730366819561076}"]
[2024-11-01 14:48:19,469] wxzhu-macbook-pro/INFO/milvus_locust: Search result: ["id: 1730366090137094, distance: 37.22340774536133, entity: {'id': 1730366090137094}", "id: 1730366082491995, distance: 37.22340774536133, entity: {'id': 1730366082491995}", "id: 1730366504037575, distance: 36.07592010498047, entity: {'id': 1730366504037575}", "id: 1730366490796694, distance: 36.07592010498047, entity: {'id': 1730366490796694}", "id: 1730366443794068, distance: 35.62309265136719, entity: {'id': 1730366443794068}", "id: 1730366460226998, distance: 35.62309265136719, entity: {'id': 1730366460226998}", "id: 1730366525693177, distance: 34.25982666015625, entity: {'id': 1730366525693177}", "id: 1730366537074459, distance: 34.25982666015625, entity: {'id': 1730366537074459}", "id: 1730366084180271, distance: 34.14676284790039, entity: {'id': 1730366084180271}", "id: 1730366075426621, distance: 34.14676284790039, entity: {'id': 1730366075426621}"]

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

cluster: 4am ns: chaos-testing pod info

full-text-search-benchmark-etcd-0                                1/1     Running       0                4d3h    10.104.18.37    4am-node25   <none>           <none>
full-text-search-benchmark-etcd-1                                1/1     Running       0                4d3h    10.104.33.20    4am-node36   <none>           <none>
full-text-search-benchmark-etcd-2                                1/1     Running       0                4d3h    10.104.15.165   4am-node20   <none>           <none>
full-text-search-benchmark-milvus-datanode-66f75b7679-2wwrp      1/1     Running       0                21h     10.104.1.219    4am-node10   <none>           <none>
full-text-search-benchmark-milvus-datanode-66f75b7679-cpgvc      1/1     Running       0                21h     10.104.9.197    4am-node14   <none>           <none>
full-text-search-benchmark-milvus-datanode-66f75b7679-x5scz      1/1     Running       0                21h     10.104.34.145   4am-node37   <none>           <none>
full-text-search-benchmark-milvus-indexnode-69894445dc-7tggc     1/1     Running       0                21h     10.104.1.220    4am-node10   <none>           <none>
full-text-search-benchmark-milvus-indexnode-69894445dc-kpszz     1/1     Running       0                21h     10.104.9.193    4am-node14   <none>           <none>
full-text-search-benchmark-milvus-indexnode-69894445dc-mqw72     1/1     Running       0                21h     10.104.20.247   4am-node22   <none>           <none>
full-text-search-benchmark-milvus-mixcoord-c97dfcd5f-ktmw2       1/1     Running       0                21h     10.104.34.146   4am-node37   <none>           <none>
full-text-search-benchmark-milvus-proxy-7fd49c64fc-582pp         1/1     Running       0                21h     10.104.34.147   4am-node37   <none>           <none>
full-text-search-benchmark-milvus-querynode-1-ff9967b57-wz96r    1/1     Running       6 (36m ago)      21h     10.104.13.193   4am-node16   <none>           <none>
full-text-search-benchmark-minio-0                               1/1     Running       0                4d3h    10.104.18.38    4am-node25   <none>           <none>
full-text-search-benchmark-minio-1                               1/1     Running       0                4d3h    10.104.33.21    4am-node36   <none>           <none>
full-text-search-benchmark-minio-2                               1/1     Running       0                4d3h    10.104.15.166   4am-node20   <none>           <none>
full-text-search-benchmark-minio-3                               1/1     Running       0                4d3h    10.104.23.96    4am-node27   <none>           <none>
full-text-search-benchmark-pulsar-bookie-0                       1/1     Running       0                4d3h    10.104.18.40    4am-node25   <none>           <none>
full-text-search-benchmark-pulsar-bookie-1                       1/1     Running       0                4d3h    10.104.33.24    4am-node36   <none>           <none>
full-text-search-benchmark-pulsar-bookie-init-cvwvs              0/1     Completed     0                4d3h    10.104.18.28    4am-node25   <none>           <none>
full-text-search-benchmark-pulsar-broker-0                       1/1     Running       0                4d3h    10.104.18.33    4am-node25   <none>           <none>
full-text-search-benchmark-pulsar-proxy-0                        1/1     Running       0                4d3h    10.104.18.31    4am-node25   <none>           <none>
full-text-search-benchmark-pulsar-pulsar-init-stw6s              0/1     Completed     0                4d3h    10.104.18.30    4am-node25   <none>           <none>
full-text-search-benchmark-pulsar-zookeeper-0                    1/1     Running       0                4d3h    10.104.18.39    4am-node25   <none>           <none>

Anything else?

No response

zhuwenxing commented 3 weeks ago

This issue also occurred on the query of text match.

zhuwenxing commented 3 weeks ago

This looks like a common issue, I've tried querying with a range ID, and it also results in empty results for a short period of time.

xiaofan-luan commented 3 weeks ago

@aoiasd

are we loading stats asyncly? I think we need to wait all stats loaded before we can serve

aoiasd commented 3 weeks ago

@aoiasd

are we loading stats asyncly? I think we need to wait all stats loaded before we can serve

No, loading stats was part of load segment, if segment in target, it stats should load completed and can serve.

zhuwenxing commented 2 weeks ago

Image tag: master-20241107-f813fb45-amd64 Using the image tag from 11/07, the test was conducted in the same way, and the behavior changed to:

  1. The entire process did not fail
  2. However, during the time period when the pod was killed, the search latency increased from 300ms to 1 minute.

In both the 11-01 and this test, the search interface did not have a timeout parameter set, and it is unclear why the search interface reported an error directly in the 11-01 test, while this time it was delayed.

image

grafana link

xiaofan-luan commented 2 weeks ago

Image tag: master-20241107-f813fb45-amd64 Using the image tag from 11/07, the test was conducted in the same way, and the behavior changed to:

  1. The entire process did not fail
  2. However, during the time period when the pod was killed, the search latency increased from 300ms to 1 minute.

In both the 11-01 and this test, the search interface did not have a timeout parameter set, and it is unclear why the search interface reported an error directly in the 11-01 test, while this time it was delayed.

image

grafana link

I think this is the reasonable behaviour. milvus retry and wait for segments to be reloaded.

The only problem is why service is recovered slower than expected?

liliu-z commented 2 weeks ago

/assign @zhuwenxing /unassign @czs007

yanliang567 commented 2 weeks ago

@liliu-z I think we still need a fix for the slow recovery issue? /assign @aoiasd could you please take a look?

/unassign @zhuwenxing

aoiasd commented 1 week ago

Querynode reload all segment just use 15s,but search task and pipeline will wait other 30s before start run.

aoiasd commented 1 week ago

Not a unique problem for bm25 or text match, but for all querynode recovery.

liliu-z commented 1 week ago

Knowhere latency increased meaning high CPU usage during the loading progress similar to https://github.com/milvus-io/milvus/issues/37796

aoiasd commented 1 week ago

image Search latency in worker increased but delegator works well, may relate with load L0 on worker

yanliang567 commented 3 days ago

remoteload was reverted, please help to rerun the test and update the results. @zhuwenxing

/assign @zhuwenxing