milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.02k stars 2.95k forks source link

[Bug]: [benchmark][cluster] queryNode, proxy OOM during concurrent query with large retrieve #37157

Open wangting0128 opened 1 month ago

wangting0128 commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version:2.4-20241021-1dcc393e-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

server:

NAME                                                              READY   STATUS      RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
query-perf-compare-etcd-0                                         1/1     Running     0               3d6h    10.104.23.116   4am-node27   <none>           <none>
query-perf-compare-etcd-1                                         1/1     Running     0               3d6h    10.104.32.162   4am-node39   <none>           <none>
query-perf-compare-etcd-2                                         1/1     Running     0               3d6h    10.104.17.52    4am-node23   <none>           <none>
query-perf-compare-milvus-datanode-86fcd77c8-l5jqr                1/1     Running     0               3d6h    10.104.34.85    4am-node37   <none>           <none>
query-perf-compare-milvus-indexnode-7ccf5b5cc-jr7vh               1/1     Running     0               3d6h    10.104.34.86    4am-node37   <none>           <none>
query-perf-compare-milvus-indexnode-7ccf5b5cc-m5t7w               1/1     Running     0               3d6h    10.104.18.243   4am-node25   <none>           <none>
query-perf-compare-milvus-indexnode-7ccf5b5cc-mf9mx               1/1     Running     1 (4h5m ago)    3d6h    10.104.15.174   4am-node20   <none>           <none>
query-perf-compare-milvus-indexnode-7ccf5b5cc-v6whl               1/1     Running     1 (4h9m ago)    3d6h    10.104.16.109   4am-node21   <none>           <none>
query-perf-compare-milvus-mixcoord-677b55b59-zrxzh                1/1     Running     0               3d6h    10.104.34.88    4am-node37   <none>           <none>
query-perf-compare-milvus-proxy-b8b4b5595-52cpl                   1/1     Running     1 (3h37m ago)   2d2h    10.104.20.219   4am-node22   <none>           <none>
query-perf-compare-milvus-querynode-f65899674-88b62               1/1     Running     1 (3h36m ago)   3d3h    10.104.18.49    4am-node25   <none>           <none>
query-perf-compare-milvus-querynode-f65899674-hpcj6               1/1     Running     0               3d3h    10.104.20.4     4am-node22   <none>           <none>
query-perf-compare-milvus-querynode-f65899674-k9dk9               1/1     Running     0               3d3h    10.104.32.50    4am-node39   <none>           <none>
query-perf-compare-minio-0                                        1/1     Running     0               4d      10.104.23.9     4am-node27   <none>           <none>
query-perf-compare-minio-1                                        1/1     Running     0               4d      10.104.32.100   4am-node39   <none>           <none>
query-perf-compare-minio-2                                        1/1     Running     0               4d      10.104.17.143   4am-node23   <none>           <none>
query-perf-compare-minio-3                                        1/1     Running     0               4d      10.104.27.247   4am-node31   <none>           <none>
query-perf-compare-pulsar-bookie-0                                1/1     Running     0               4d      10.104.23.8     4am-node27   <none>           <none>
query-perf-compare-pulsar-bookie-1                                1/1     Running     0               4d      10.104.27.245   4am-node31   <none>           <none>
query-perf-compare-pulsar-bookie-2                                1/1     Running     0               4d      10.104.32.104   4am-node39   <none>           <none>
query-perf-compare-pulsar-broker-0                                1/1     Running     0               4d      10.104.1.21     4am-node10   <none>           <none>
query-perf-compare-pulsar-proxy-0                                 1/1     Running     0               4d      10.104.14.177   4am-node18   <none>           <none>
query-perf-compare-pulsar-recovery-0                              1/1     Running     0               4d      10.104.5.110    4am-node12   <none>           <none>
query-perf-compare-pulsar-zookeeper-0                             1/1     Running     0               4d      10.104.23.2     4am-node27   <none>           <none>
query-perf-compare-pulsar-zookeeper-1                             1/1     Running     0               4d      10.104.32.106   4am-node39   <none>           <none>
query-perf-compare-pulsar-zookeeper-2                             1/1     Running     0               4d      10.104.17.147   4am-node23   <none>           <none>

describe pod

4am % kubectl get pod -o wide -n qa-milvus|grep -E "query-perf-compare|NAME"|grep ago|awk '{print $1}'|while read line; do echo $line && kubectl describe pod $line -n qa-milvus|grep Reason; done
query-perf-compare-milvus-indexnode-7ccf5b5cc-mf9mx
      Reason:       OOMKilled
query-perf-compare-milvus-indexnode-7ccf5b5cc-v6whl
      Reason:       OOMKilled
query-perf-compare-milvus-proxy-b8b4b5595-52cpl
      Reason:       OOMKilled
query-perf-compare-milvus-querynode-f65899674-88b62
      Reason:       OOMKilled

pod monitor: queryNode

截屏2024-10-25 18 39 43 截屏2024-10-25 18 42 37

proxy

截屏2024-10-25 18 40 12

client log:

[2024-10-25 06:53:22.1025][    INFO] - Start reading yaml file: example/config.yaml (utils.go:95:ParserInputConfig)
[2024-10-25 06:53:22.1025][    INFO] - Test config after parsing: {"URI":"10.104.20.219:19530","OpenTls":false,"Username":"root","Password":"Milvus","OutputFormat":"json","OutputFile":"/tmp/bench.log","CaseParams":{"dataset_params":{"metric_type":"L2","dim":768,"vector_field":"float_vector"},"collection_params":{"collection_name":"query_perf_20m"},"index_params":{"index_type":"IVF_SQ8"},"concurrent_params":{"concurrent_number":50,"during_time":120,"interval":20},"concurrent_tasks":{"query":{"type":"query","weight":1,"params":{"expr":"","output_fields":["float_vector"],"timeout":60,"limit":16384,"random_range":[0,0],"custom_expr":" !(int64_inverted < {0}) ","custom_range":[1000000,2000000]}}}}}
 (utils.go:185:EncodeJsonString)
[2024-10-25 06:53:22.1025][    INFO] - Concurrent tasks: {query: 50} (utils.go:281:CountRequestNum)
[2024-10-25 06:53:42.1025][    INFO] - Name                                     # reqs          # fails |               Avg              Min              Max           Median             TP99 |             req/s       failures/s  (run.go:489:printStats)
[2024-10-25 06:53:42.1025][    INFO] - go query                                     22         0(0.00%) |          5641.406         2819.668         7876.073         5597.871         7876.073 |              1.10             0.00  (run.go:500:printStats)
[2024-10-25 06:54:02.1025][    INFO] - Name                                     # reqs          # fails |               Avg              Min              Max           Median             TP99 |             req/s       failures/s  (run.go:489:printStats)
[2024-10-25 06:54:02.1025][    INFO] - go query                                     32         0(0.00%) |         35083.023        30399.286        39240.032        35234.827        39240.032 |              0.50             0.00  (run.go:500:printStats)
[2024-10-25 06:54:17.1025][   ERROR] - stack trace: /workspace/source/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/workspace/source/internal/util/grpcclient/client.go:555 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
/workspace/source/internal/util/grpcclient/client.go:569 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/workspace/source/internal/distributed/querynode/client/client.go:90 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]
/workspace/source/internal/distributed/querynode/client/client.go:215 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Query
/workspace/source/internal/proxy/task_query.go:554 github.com/milvus-io/milvus/internal/proxy.(*queryTask).queryShard
/workspace/source/internal/proxy/lb_policy.go:178 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1
/workspace/source/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do
/workspace/source/internal/proxy/lb_policy.go:147 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry
/workspace/source/internal/proxy/lb_policy.go:215 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute.func2: rpc error: code = Canceled desc = grpc: the client connection is closing (run.go:333:func1)
[2024-10-25 06:54:17.1025][   ERROR] - stack trace: /workspace/source/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
/workspace/source/internal/util/grpcclient/client.go:555 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
/workspace/source/internal/util/grpcclient/client.go:569 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
/workspace/source/internal/distributed/querynode/client/client.go:90 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]
/workspace/source/internal/distributed/querynode/client/client.go:215 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Query
/workspace/source/internal/proxy/task_query.go:554 github.com/milvus-io/milvus/internal/proxy.(*queryTask).queryShard
/workspace/source/internal/proxy/lb_policy.go:178 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1
/workspace/source/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do
/workspace/source/internal/proxy/lb_policy.go:147 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry
/workspace/source/internal/proxy/lb_policy.go:215 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute.func2: rpc error: code = Canceled desc = grpc: the client connection is closing (run.go:333:func1)

Expected Behavior

No response

Steps To Reproduce

1. a collection with fields:["id","float_vector","varchar_uri","varchar_noindex","varchar_inverted","int64_noindex","int64_inverted"]
2. index
  - HNSW: float_vector
  - INVERTED: varchar_inverted, int64_inverted
3. concurrey query
  concurrent number: 50
  query params: "output_fields":["float_vector"],"timeout":60,"limit":16384, expr: " !(int64_inverted < {random range [1000000,2000000]}) "

Milvus Log

No response

Anything else?

No response

xiaofan-luan commented 1 month ago

I think this is as expected if you have many concurrent read and the response is large.

so far there is nothing we can handle this

xiaofan-luan commented 1 month ago

I don't think this is an important case to handle, let's keep this as is

wangting0128 commented 1 month ago

I think this is as expected if you have many concurrent read and the response is large.

so far there is nothing we can handle this

Got it! closed it now

xiaofan-luan commented 1 month ago

let's keep it to leave to @zhagnlu @MrPresent-Han

I don't think of a easy way to fix that. maybe changing the reduce function can solve the problem. Any thoughts?

xiaofan-luan commented 1 month ago

anyway this is not a critical issue

yanliang567 commented 1 month ago

/assign @zhagnlu /unassign

zhagnlu commented 1 month ago

let's keep it to leave to @zhagnlu @MrPresent-Han

I don't think of a easy way to fix that. maybe changing the reduce function can solve the problem. Any thoughts?

Yes, I think if reduce can be done in multi-segment of the same querynode, may we can spill temp result to file to solve oom?

xiaofan-luan commented 4 weeks ago

let's keep it to leave to @zhagnlu @MrPresent-Han I don't think of a easy way to fix that. maybe changing the reduce function can solve the problem. Any thoughts?

Yes, I think if reduce can be done in multi-segment of the same querynode, may we can spill temp result to file to solve oom?

I think the problem is about streaming reducing on proxies.

if we get 100 querynodes, each proxy receive 1000topk result, that is still gonna to be huge if each row is 1KB