milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.33k stars 2.91k forks source link

[Bug]: Five out of seven flush requests 120s timeout #33954

Open ThreadDao opened 4 months ago

ThreadDao commented 4 months ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4-20240618-79546a6c-amd64
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

create milvus with config:

  config:
    dataCoord:
      segment:
        sealProportion: 1.52e-05
    log:
      level: debug
    quotaAndLimits:
      flushRate:
        enabled: true
        max: 0.1 
    trace:
      exporter: jaeger
      jaeger:
        url: http://tempo-distributor.tempo:14268/api/traces

test steps

  1. create collection with 1024 partitions (partition-key), 1 shard
  2. create index
  3. insert 10m-128d data -> flush
  4. index -> load
  5. concurrent requests: search + upsert + flush image
  6. there are 5 of the 7 flush 120s timeout
    [2024-06-18 10:44:06,589 -  INFO - fouram]: grpc     flush                                                                              7    5(71.43%) | 361353  249863  503063 367000 |    0.00        0.00 (stats.py:789)
    [2024-06-18 10:44:06,590 -  INFO - fouram]: grpc     search                                                                           301    10(3.32%) |  63813   15424  120006  60000 |    0.17        0.01 (stats.py:789)
    [2024-06-18 10:44:06,590 -  INFO - fouram]: grpc     upsert                                                                           160     0(0.00%) | 180105     588  301469 175000 |    0.09        0.00 (stats.py:789)
    [2024-06-18 10:44:06,590 -  INFO - fouram]:          Aggregated                                                                       468    15(3.21%) | 108021     588  503063  81000 |    0.27        0.01 (stats.py:789)

Expected Behavior

No response

Steps To Reproduce

https://argo-workflows.zilliz.cc/archived-workflows/qa/88b56c6a-eb3d-4862-95a6-b0c64434efde?nodeId=compact-opt-1024-with-flush-2

Milvus Log

pods:

compact-opt-flush2-milvus-datanode-5898b9d778-sshqx               1/1     Running     0                82m     10.104.5.70     4am-node12   <none>           <none>
compact-opt-flush2-milvus-indexnode-8c577d9d6-9tnms               1/1     Running     0                82m     10.104.17.163   4am-node23   <none>           <none>
compact-opt-flush2-milvus-indexnode-8c577d9d6-9wl8n               1/1     Running     0                82m     10.104.6.58     4am-node13   <none>           <none>
compact-opt-flush2-milvus-indexnode-8c577d9d6-qq9c4               1/1     Running     0                82m     10.104.20.226   4am-node22   <none>           <none>
compact-opt-flush2-milvus-mixcoord-5b9f79b984-zwfn2               1/1     Running     0                82m     10.104.4.88     4am-node11   <none>           <none>
compact-opt-flush2-milvus-proxy-b55c6db47-vnzc2                   1/1     Running     0                82m     10.104.13.204   4am-node16   <none>           <none>
compact-opt-flush2-milvus-querynode-0-786c99d5cc-k4bcz            1/1     Running     0                82m     10.104.18.196   4am-node25   <none>           <none>
compact-opt-flush2-milvus-querynode-0-786c99d5cc-q5znj            1/1     Running     0                82m     10.104.13.205   4am-node16   <none>           <none>

Anything else?

No response

yanliang567 commented 4 months ago

/unassign

bigsheeper commented 4 months ago

The log got lost; Please let me know if this issue reproduce again, thx@ThreadDao

XuanYang-cn commented 3 days ago

Is this reproducing? /assign @ThreadDao /unassign