milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.08k stars 2.88k forks source link

[Bug]: [fake-128cu] proxy pod oomkilled and some query requests StatusCode.UNAVAILABLE #29011

Closed ThreadDao closed 10 months ago

ThreadDao commented 10 months ago

Is there an existing issue for this?

Environment

- Milvus version: v2.2.15
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.2.17
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. Release existing collection fouram_XepQhvAI, load collection cost 2168.9064s. And segment info:
    'segment_counts': 141
    'segment_total_vectors': 57745408
  2. concurrent insert + delete + query + search, some query failed StatusCode.UNAVAILABLE, the two proxy pods are oomkilled. Proxy resource config is:
    Limits:
      cpu:     4
      memory:  4Gi
    Requests:
      cpu:      4
      memory:   4Gi

query failed:

[2023-12-05 20:21:18,038 - ERROR - fouram]: RPC error: [query], <MilvusException: (code=<bound method _MultiThreadedRendezvous.code of <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Socket closed", grpc_status:14, created_time:"2023-12-05T20:15:57.52638993+00:00"}"
>>, message=Retry timeout: 1200s, message=Socket closed)>, <Time:{'RPC start': '2023-12-05 19:56:53.384343', 'RPC error': '2023-12-05 20:21:18.038669'}> (decorators.py:126)
[2023-12-05 20:21:18,040 - ERROR - fouram]: (api_response) : <MilvusException: (code=<bound method _MultiThreadedRendezvous.code of <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Socket closed"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Socket closed", grpc_status:14, created_time:"2023-12-05T20:15:57.52638993+00:00"}"
>>, message=Retry timeout: 1200s, message=Socket closed)> (api_request.py:54)

By the way, concurrent task config is:

    concurrent_tasks:
      - type: search
        weight: 3
        params:
          nq: 100 
          top_k: 100 
          random_data: true
          output_fields: ['int64_pk_5b']
          search_param:
            search_list: 100 
          timeout: 1200
      - type: query
        weight: 2
        params:
         expr: 'id >= 50000000'
         timeout: 1200
      - type: delete
        weight: 2
        params:
          expr: null
          delete_length: 60
          timeout: 1200
      - type: insert
        weight: 3
        params:
          nb: 100 
          timeout: 1200
          start_id: 50000000
          random_id: true
          random_vector: true

Expected Behavior

No response

Steps To Reproduce

4am argo: yanliang-est-128cu-cron-9fx5b
grafana: https://grafana-4am.zilliz.cc/d/uLf5cJ3Ga/milvus2-0?orgId=1&var-datasource=prometheus&var-cluster=&var-namespace=qa-milvus&var-instance=yanliang-est-128cu&var-collection=All&var-app_name=milvus&from=1701792025000&to=1701831455000

Milvus Log

pods:

yanliang-est-128cu-etcd-0                                         1/1     Running                           1 (4d9h ago)      22d     10.104.23.62    4am-node27   <none>           <none>
yanliang-est-128cu-etcd-1                                         1/1     Running                           0                 2d16h   10.104.24.180   4am-node29   <none>           <none>
yanliang-est-128cu-etcd-2                                         1/1     Running                           1 (4d9h ago)      22d     10.104.17.183   4am-node23   <none>           <none>
yanliang-est-128cu-milvus-datanode-f68547bd5-2pb7t                1/1     Running                           318 (47h ago)     6d19h   10.104.12.234   4am-node17   <none>           <none>
yanliang-est-128cu-milvus-datanode-f68547bd5-qnpgf                1/1     Running                           308 (47h ago)     3d10h   10.104.1.13     4am-node10   <none>           <none>
yanliang-est-128cu-milvus-indexnode-56c5485449-77q87              1/1     Running                           4 (3d10h ago)     6d19h   10.104.1.253    4am-node10   <none>           <none>
yanliang-est-128cu-milvus-indexnode-56c5485449-77v5w              1/1     Running                           4 (3d10h ago)     6d19h   10.104.13.55    4am-node16   <none>           <none>
yanliang-est-128cu-milvus-indexnode-56c5485449-mmjb7              1/1     Running                           3 (3d10h ago)     6d19h   10.104.12.220   4am-node17   <none>           <none>
yanliang-est-128cu-milvus-mixcoord-5b7c556647-v6z9v               1/1     Running                           194 (2d17h ago)   6d19h   10.104.5.115    4am-node12   <none>           <none>
yanliang-est-128cu-milvus-proxy-645b5f85bf-tn9w4                  1/1     Running                           201 (134m ago)    6d19h   10.104.9.116    4am-node14   <none>           <none>
yanliang-est-128cu-milvus-proxy-645b5f85bf-vntlc                  1/1     Running                           6 (4h16m ago)     42h     10.104.13.252   4am-node16   <none>           <none>
yanliang-est-128cu-milvus-querynode-78c4cd848f-5vcrg              1/1     Running                           114 (42h ago)     6d19h   10.104.5.118    4am-node12   <none>           <none>
yanliang-est-128cu-milvus-querynode-78c4cd848f-6dh7b              1/1     Running                           0                 42h     10.104.18.185   4am-node25   <none>           <none>
yanliang-est-128cu-milvus-querynode-78c4cd848f-bgw5m              1/1     Running                           109 (42h ago)     6d19h   10.104.23.41    4am-node27   <none>           <none>
yanliang-est-128cu-milvus-querynode-78c4cd848f-c9dgg              1/1     Running                           119 (42h ago)     6d19h   10.104.4.48     4am-node11   <none>           <none>
yanliang-est-128cu-milvus-querynode-78c4cd848f-jg926              1/1     Running                           112 (42h ago)     6d19h   10.104.13.47    4am-node16   <none>           <none>
yanliang-est-128cu-milvus-querynode-78c4cd848f-l8tsj              1/1     Running                           113 (42h ago)     6d19h   10.104.9.118    4am-node14   <none>           <none>
yanliang-est-128cu-milvus-querynode-78c4cd848f-lzr8p              1/1     Running                           113 (42h ago)     6d19h   10.104.14.25    4am-node18   <none>           <none>
yanliang-est-128cu-minio-0                                        1/1     Running                           1 (4d9h ago)      22d     10.104.23.30    4am-node27   <none>           <none>
yanliang-est-128cu-minio-1                                        1/1     Running                           1 (4d9h ago)      22d     10.104.20.81    4am-node22   <none>           <none>
yanliang-est-128cu-minio-2                                        1/1     Running                           1 (4d9h ago)      22d     10.104.17.131   4am-node23   <none>           <none>
yanliang-est-128cu-minio-3                                        1/1     Running                           0                 2d11h   10.104.24.156   4am-node29   <none>           <none>
yanliang-est-128cu-pulsar-bookie-0                                1/1     Running                           0                 2d11h   10.104.24.177   4am-node29   <none>           <none>
yanliang-est-128cu-pulsar-bookie-1                                1/1     Running                           47 (47h ago)      2d3h    10.104.19.8     4am-node28   <none>           <none>
yanliang-est-128cu-pulsar-bookie-2                                1/1     Running                           1 (4d9h ago)      22d     10.104.20.158   4am-node22   <none>           <none>
yanliang-est-128cu-pulsar-broker-0                                1/1     Running                           0                 42h     10.104.6.122    4am-node13   <none>           <none>
yanliang-est-128cu-pulsar-proxy-0                                 1/1     Running                           0                 2d17h   10.104.13.106   4am-node16   <none>           <none>
yanliang-est-128cu-pulsar-recovery-0                              1/1     Running                           1 (4d9h ago)      22d     10.104.20.90    4am-node22   <none>           <none>
yanliang-est-128cu-pulsar-zookeeper-0                             1/1     Running                           1 (47h ago)       2d3h    10.104.19.6     4am-node28   <none>           <none>
yanliang-est-128cu-pulsar-zookeeper-1                             1/1     Running                           1 (4d9h ago)      22d     10.104.23.51    4am-node27   <none>           <none>
yanliang-est-128cu-pulsar-zookeeper-2                             1/1     Running                           1 (4d10h ago)     22d     10.104.15.211   4am-node20   <none>           <none>

Anything else?

No response

yanliang567 commented 10 months ago

is serach/query request recoverable after the proxy coming back online? @ThreadDao /assign @weiliu1031 /unassign

ThreadDao commented 10 months ago

is serach/query request recoverable after the proxy coming back online? @ThreadDao /assign @weiliu1031 /unassign

Yes, recoverable. after proxy pod running query succ

weiliu1031 commented 10 months ago

image

concurrent large query request which return 7m entries causes a large chanllenge to proxy's memory, especially in the reduceInternelRetrieveResults progress