milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.06k stars 2.88k forks source link

[Bug]: Failed to search: node offline[node=-1]: channel not available when `streamingDeltaForwardPolicy` is `Direct` #36887

Open ThreadDao opened 6 days ago

ThreadDao commented 6 days ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4-20241010-eaa94875-amd64
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

milvus cluster

deploy a milvus with config

  config:
    dataCoord:
      enableActiveStandby: true
      segment:
        expansionRate: 1.15
        maxSize: 2048
        sealProportion: 0.12
    dataNode:
      compaction:
        levelZeroBatchMemoryRatio: 0.5
    indexCoord:
      enableActiveStandby: true
    log:
      level: debug
    minio:
      accessKeyID: miniozong
      bucketName: bucket-zong
      rootPath: compact_2
      secretAccessKey: miniozong
    queryCoord:
      enableActiveStandby: true
    **queryNode:
      levelZeroForwardPolicy: RemoteLoad
      streamingDeltaForwardPolicy: Direct**
    quotaAndLimits:
      dml:
        deleteRate:
          max: 0.5
        enabled: false
        insertRate:
          max: 8
        upsertRate:
          max: 8
      growingSegmentsSizeProtection:
        enabled: false
        highWaterLevel: 0.2
        lowWaterLevel: 0.1
      limitWriting:
        memProtection:
          dataNodeMemoryHighWaterLevel: 0.85
          dataNodeMemoryLowWaterLevel: 0.75
          queryNodeMemoryHighWaterLevel: 0.85
          queryNodeMemoryLowWaterLevel: 0.75
      limits:
        complexDeleteLimitEnable: true
    rootCoord:
      enableActiveStandby: true
    trace:
      exporter: jaeger
      jaeger:
        url: http://tempo-distributor.tempo:14268/api/traces"
      sampleFraction: 1

test steps

  1. There are a collection with a int64 pk field and a vector field. Collection has 100m entities
  2. When starting to delete, the search fails
    [2024-10-15 10:48:02,882 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=503, message=fail to search on QueryNode 23: distribution is not servcieable: channel not available[channel=compact-opt-100m-2-rootcoord-dml_0_453128445902192997v0])>, <Time:{'RPC start': '2024-10-15 10:47:45.846202', 'RPC error': '2024-10-15 10:48:02.882154'}> (decorators.py:147)

    client delete log:

    [2024-10-15 18:46:51,711 - INFO - ci_test]: start to delete [0, ..., 15999] with length 16000 (tmp.py:40)
    [2024-10-15 18:46:51,825 - INFO - ci_test]: delete cost 0.11316943168640137 with res (insert count: 0, delete count: 16000, upsert count: 0, timestamp: 0, success count: 0, err count: 0 (tmp.py:44)
    [2024-10-15 18:46:52,716 - INFO - ci_test]: start to delete [16000, ..., 31999] with length 16000 (tmp.py:40)
    ...
    [2024-10-15 18:51:55,817 - INFO - ci_test]: delete cost 0.11813139915466309 with res (insert count: 0, delete count: 16000, upsert count: 0, timestamp: 0, success count: 0, err count: 0 (tmp.py:44)
    [2024-10-15 18:51:56,703 - INFO - ci_test]: start to delete [4864000, ..., 4879999] with length 16000 (tmp.py:40)
    [2024-10-15 18:51:56,825 - INFO - ci_test]: delete cost 0.12181949615478516 with res (insert count: 0, delete count: 16000, upsert count: 0, timestamp: 0, success count: 0, err count: 0 (tmp.py:44)

Expected Behavior

No response

Steps To Reproduce

- https://argo-workflows.zilliz.cc/archived-workflows/qa/64dab658-11fc-4a63-ac02-8770c303363f?nodeId=compact-opt-delete-100m-6b
- delete scripts

def get_ids(start, end, batch):
    while True:
        batch = min(batch, end - start)
        if start >= end:
            yield None
        ids = [i for i in range(start, start+batch)]
        start = start + len(ids)
        yield ids

def delete_with_rate(_host, _name, _start, _end, _batch, pk="id"):
    connections.connect(host=_host)
    c = Collection(name=_name)
    for ids in get_ids(_start, _end, _batch):
        if ids is None:
            break
        log.info(f"start to delete [{ids[0]}, ..., {ids[-1]}] with length {len(ids)}")
        start_time = time.time()
        delete_res = c.delete(expr=f"{pk} in {ids}")
        cost = time.time() - start_time
        log.info(f"delete cost {cost} with res {delete_res}")
        if cost < 1:
            time.sleep(1 - cost)

if __name__ == '__main__':
    host = "xxx"
    name = "fouram_3QEsE82U"
    delete_with_rate(host, name, 0, 50000000, _batch=16000)

### Milvus Log

pods:

compact-opt-100m-2-milvus-datanode-74b5c7854b-xxcdl 1/1 Running 0 3h53m 10.104.14.7 4am-node18 compact-opt-100m-2-milvus-indexnode-6cd9b49f5-9xtfj 1/1 Running 0 3h52m 10.104.4.36 4am-node11 compact-opt-100m-2-milvus-indexnode-6cd9b49f5-qb26s 1/1 Running 0 3h53m 10.104.17.2 4am-node23 compact-opt-100m-2-milvus-indexnode-6cd9b49f5-zp5bj 1/1 Running 0 3h51m 10.104.1.234 4am-node10 compact-opt-100m-2-milvus-mixcoord-8f9875d6d-khsb4 1/1 Running 0 3h53m 10.104.4.33 4am-node11 compact-opt-100m-2-milvus-proxy-5bd9875bb4-tkrzw 1/1 Running 0 3h53m 10.104.9.107 4am-node14 compact-opt-100m-2-milvus-querynode-0-7488f76b9b-8dz69 1/1 Running 0 3h52m 10.104.20.48 4am-node22 compact-opt-100m-2-milvus-querynode-0-7488f76b9b-dqwzf 1/1 Running 0 3h49m 10.104.23.93 4am-node27 compact-opt-100m-2-milvus-querynode-0-7488f76b9b-hs6hn 1/1 Running 0 3h53m 10.104.24.147 4am-node29 compact-opt-100m-2-milvus-querynode-0-7488f76b9b-p5dcv 1/1 Running 0 3h50m 10.104.30.192 4am-node38



### Anything else?

_No response_
yanliang567 commented 6 days ago

/unassign

ThreadDao commented 3 days ago

search failed fixed, but deletegator and 2 querynodes oom

compact-opt-100m-2-milvus-querynode-1-68b66fcf88-22v7w            1/1     Running     2 (3h54m ago)   26h     10.104.23.52    4am-node27   <none>           <none>
compact-opt-100m-2-milvus-querynode-1-68b66fcf88-g9rdb            1/1     Running     1 (3h53m ago)   26h     10.104.34.82    4am-node37   <none>           <none>
compact-opt-100m-2-milvus-querynode-1-68b66fcf88-vxdnw            1/1     Running     2 (3h34m ago)   26h     10.104.30.132   4am-node38   <none>           <none>
compact-opt-100m-2-milvus-querynode-1-68b66fcf88-zbqhq            1/1     Running     3 (3h52m ago)   26h     10.104.24.50    4am-node29   <none>           <none>
xiaofan-luan commented 7 hours ago

seems that it hits quota limitation and shouldn't be OOMed?

The OOM happened after "memory quota exceeded" log?