milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.08k stars 2.88k forks source link

[Bug]: Fail to Search: reason=distribution is no servcieable: channel not available #30799

Closed ThreadDao closed 7 months ago

ThreadDao commented 8 months ago

Is there an existing issue for this?

Environment

- Milvus version: cardinal-milvus-io-2.3-ef086dc-20240222
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.3.6rc3
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. collection laion_stable_4 has about 60m-768d data and it's schema is:
    {'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'int64_pk_5b', 'description': '', 'type': <DataType.INT64: 5>, 'is_partition_key': True}, {'name': 'varchar_caption', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}, {'name': 'varchar_NSFW', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}, {'name': 'float64_similarity', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'int64_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_original_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_original_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'varchar_md5', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}], 'enable_dynamic_field': True}
  2. reload collection and concurrent requests: insert + delete + flush + search
    concurrent_params:
      concurrent_number: 50
      during_time: '10h'
      interval: 120 
    concurrent_tasks:
      -   
        type: query
        weight: 9
        params:
          #output_fields: ['float_vector']
          expr: "50000000 < id < 5010000"
          timeout: 1200
      -   
        type: search
        weight: 50
        params:
          nq: 10
          top_k: 100 
          #output_fields: ["float32_vector"]
          random_data: true
          search_param:
            ef: 100 
          timeout: 600 
      -   
        type: insert
        weight: 20
        params:
          nb: 200 
          start_id: 63000000
          random_id: true
          random_vector: true
          timeout: 600 
      -   
        type: delete
        weight: 20
        params:
          delete_length: 100 
          timeout: 600 
      -   
        type: flush
        weight: 1
        params:
          timeout: 600 
  3. many search/query failed:
    [2024-02-23 08:19:15,986 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=65535, message=failed to search/query delegator 1652 for channel laion1b-test-2-rootcoord-dml_5_447902879639453431v1: fail to Search, QueryNode ID=1652
    , reason=distribution is not servcieable: channel not available[channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1])>, <Time:{'RPC start': '2024-02-23 08:17:52.318713', 'RPC error': '2024-02-23 08:19:15.986667'}> (decorators.py:134
    )
    [2024-02-23 08:19:15,989 - ERROR - fouram]: (api_response) : [Collection.search] <MilvusException: (code=65535, message=failed to search/query delegator 1652 for channel laion1b-test-2-rootcoord-dml_5_447902879639453431v1: fail to Search, Q
    ueryNode ID=1652, reason=distribution is not servcieable: channel not available[channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1])>, [requestId: 09fad0ee-d224-11ee-82c2-825c0a4c1e06] (api_request.py:57)
    [2024-02-23 08:19:15,989 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=65535, message=failed to search/query delegator 1652 for channel laion1b-test-2-rootcoord-dml_5_447902879639453431v1: fail
    to Search, QueryNode ID=1652, reason=distribution is not servcieable: channel not available[channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1])> (func_check.py:48)
    [2024-02-23 08:19:15,995 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=65535, message=failed to search/query delegator 1652 for channel laion1b-test-2-rootcoord-dml_5_447902879639453431v1: fail to Search, QueryNode ID=1652
    , reason=distribution is not servcieable: channel not available[channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1])>, <Time:{'RPC start': '2024-02-23 08:17:52.268824', 'RPC error': '2024-02-23 08:19:15.995875'}> (decorators.py:134
    )

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

Anything else?

No response

chyezh commented 7 months ago

delegator on laion1b-test-2-milvus-querynode-1-5956458d77-jpf7x fail to apply delete on segment 447902879710617858 at node laion1b-test-2-milvus-querynode-1-5956458d77-4bntq. make the delegator distribution view unavailable.

error: Node Not Found.


2024-02-23 16:19:15.124 | querynodequerynode4am-node23laion1b-test-2-milvus-querynode-1-5956458d77-jpf7x | [2024/02/23 08:19:15.124 +00:00] [DEBUG] [delegator/delegator_data.go:257] ["delegator plan to applyDelete via worker"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879710617858] [workerID=1651] |  
-- | -- | -- | --
  |   | 2024-02-23 16:19:15.327 | querynodequerynode4am-node23laion1b-test-2-milvus-querynode-1-5956458d77-jpf7x | [2024/02/23 08:19:15.327 +00:00] [WARN] [delegator/delegator_data.go:274] ["try to delete data on non-exist node"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879710617858] [workerID=1651] |  
  |   | 2024-02-23 16:19:15.327 | querynodequerynode4am-node23laion1b-test-2-milvus-querynode-1-5956458d77-jpf7x | [2024/02/23 08:19:15.327 +00:00] [WARN] [delegator/delegator_data.go:288] ["apply delete for segment failed, marking it offline"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879710617858] [workerID=1651] |  
  |   | 2024-02-23 16:19:15.368 | querynodequerynode4am-node23laion1b-test-2-milvus-querynode-1-5956458d77-jpf7x | [2024/02/23 08:19:15.368 +00:00] [WARN] [delegator/delegator_data.go:238] ["failed to apply delete, mark segment offline"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [offlineSegments="[447902879710617858]"]

Session not found.


2024-02-23 16:19:15.327 | (no unique labels) | [2024/02/23 08:19:15.326 +00:00] [WARN] [cluster/worker.go:114] ["failed to call Delete, worker return error"] [workerID=1651] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:87 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:328 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Delete\n/go/src/github.com/milvus-io/milvus/internal/querynodev2/cluster/worker.go:108 github.com/milvus-io/milvus/internal/querynodev2/cluster.(*remoteWorker).Delete\n/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:263 github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).applyDelete.func1\n/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:258 github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).applyDelete\n/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:212 github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).ProcessDelete.func1: attempt #0: session not found->querynode-1651: node not found[node=1651]: unrecoverable error"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: attempt #0: session not found->querynode-1651: node not found[node=1651]: unrecoverable error\n(1) attached stack trace\n  -- stack trace:\n  \| github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  \| \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550\n  \| github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  \| \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564\n  \| github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  \| \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:87\n  \| github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Delete\n  \| \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:328\n  \| github.com/milvus-io/milvus/internal/querynodev2/cluster.(*remoteWorker).Delete\n  \| \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/cluster/worker.go:108\n  \| github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).applyDelete.func1\n  \| \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:263\n  \| github.com/milvus-io/milvus/pkg/util/retry.Do\n  \| \t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44\n  \| github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).applyDelete\n  \| \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:258\n  \| github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).ProcessDelete.func1\n  \| \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:212\n  \| golang.org/x/sync/errgroup.(*Group).Go.func1\n  \| \t/go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75\n  \| runtime.goexit\n  \| \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n  \| /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  \| /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  \| /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:87 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  \| /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:328 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Delete\n  \| /go/src/github.com/milvus-io/milvus/internal/querynodev2/cluster/worker.go:108 github.com/milvus-io/milvus/internal/querynodev2/cluster.(*remoteWorker).Delete\n  \| /go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:263 github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).applyDelete.func1\n  \| /go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n  \| /go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:258 github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).applyDelete\n  \| /go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:212 github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).ProcessDelete.func1\nWraps: (3) attempt #0: session not found->querynode-1651: node not found[node=1651]: unrecoverable error\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) merr.multiErrors"]
-- | -- | --
chyezh commented 7 months ago

one failure cause continuous unserviceable.

image

2024-02-23 16:19:15.124 | (no unique labels) | [2024/02/23 08:19:15.124 +00:00] [DEBUG] [delegator/delegator_data.go:257] ["delegator plan to applyDelete via worker"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879710617858] [workerID=1651] |  
-- | -- | -- | --
  |   | 2024-02-23 16:19:15.327 | (no unique labels) | [2024/02/23 08:19:15.327 +00:00] [WARN] [delegator/delegator_data.go:288] ["apply delete for segment failed, marking it offline"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879710617858] [workerID=1651] |  
  |   | 2024-02-23 16:19:15.327 | (no unique labels) | [2024/02/23 08:19:15.327 +00:00] [DEBUG] [delegator/delegator_data.go:257] ["delegator plan to applyDelete via worker"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879711838680] [workerID=1651] |  
  |   | 2024-02-23 16:19:15.330 | (no unique labels) | [2024/02/23 08:19:15.330 +00:00] [DEBUG] [delegator/delegator_data.go:257] ["delegator plan to applyDelete via worker"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879711326791] [workerID=1651] |  
  |   | 2024-02-23 16:19:15.332 | (no unique labels) | [2024/02/23 08:19:15.332 +00:00] [DEBUG] [delegator/delegator_data.go:257] ["delegator plan to applyDelete via worker"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879711851858] [workerID=1651] |  
  |   | 2024-02-23 16:19:15.333 | (no unique labels) | [2024/02/23 08:19:15.333 +00:00] [DEBUG] [delegator/delegator_data.go:257] ["delegator plan to applyDelete via worker"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879711326800] [workerID=1651]
chyezh commented 7 months ago

Bug here.

Querynode use rpc-level context to GetWorker.

// GetWorker returns worker with specified nodeID.
func (m *grpcWorkerManager) GetWorker(ctx context.Context, nodeID int64) (Worker, error) {
...
            worker, err = m.builder(ctx, nodeID)
...
}

Build a grpc client may have a context with previous grpc context.

func NewClient(ctx context.Context, addr string, nodeID int64) (*Client, error) {
....
    sess := sessionutil.NewSession(ctx)
...
}

session_util use these context to handle the session operation, such as verifySession. Once the grpc client is disconnected with server, all following reconnect operation break down with a context.Canceled error.

func (c *ClientBase[T]) verifySession(ctx context.Context) error {
....
        sessions, _, getSessionErr := c.sess.GetSessions(c.GetRole())
....
}

log can be found like this.

2024-02-23 16:19:15.325 | (no unique labels) | [2024/02/23 08:19:15.324 +00:00] [WARN] [grpcclient/client.go:449] ["fail to get grpc client"] [client_role=querynode-1651] [error="failed to connect 10.104.32.44:21123, reason: context deadline exceeded"] |  
-- | -- | -- | --
  |   | 2024-02-23 16:19:15.325 | (no unique labels) | [2024/02/23 08:19:15.325 +00:00] [WARN] [grpcclient/client.go:373] ["fail to get session"] [clientRole=querynode-1651] [error="context canceled"] |  
  |   | 2024-02-23 16:19:15.325 | (no unique labels) | [2024/02/23 08:19:15.325 +00:00] [WARN] [grpcclient/client.go:431] ["failed to verify node session"] [error="session not found->querynode-1651: node not found[node=1651]"] [errorVerbose="session not found->querynode-1651: node not found[node=1651]\n(1) attached stack trace\n  -- stack trace:\n  \| github.com/milvus-io/milvus/pkg/util/merr.WrapErrNodeNotFound\n  \| \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:684\n  \| github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).verifySession\n  \| \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:383\n  \| github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).checkNodeSessionExist\n  \| \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:429\n  \| github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).call.func2\n  \| \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:464\n  \| github.com/milvus-io/milvus/pkg/util/retry.Do\n  \| \t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44\n  \| github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).call\n  \| \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:462\n  \| github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n  \| \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:548\n  \| github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n  \| \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564\n  \| github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n  \| \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:87\n  \| github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Delete\n  \| \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:328\n  \| github.com/milvus-io/milvus/internal/querynodev2/cluster.(*remoteWorker).Delete\n  \| \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/cluster/worker.go:108\n  \| github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).applyDelete.func1\n  \| \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:263\n  \| github.com/milvus-io/milvus/pkg/util/retry.Do\n  \| \t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44\n  \| github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).applyDelete\n  \| \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:258\n  \| github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).ProcessDelete.func1\n  \| \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:212\n  \| golang.org/x/sync/errgroup.(*Group).Go.func1\n  \| \t/go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75\n  \| runtime.goexit\n  \| \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) session not found->querynode-1651\nWraps: (3) node not found[node=1651]\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) merr.milvusError"]

session not found than convert into a node not found. an unrecoverable error happens.

xiaofan-luan commented 7 months ago
  1. we didn't handle etcd correctly
  2. misuse request ctx rather than server ctx when create client
  3. when retry failed, should not throw uncrecoverable error. Simply throw a do not retry error and break the retry loop is fine. This is extremely useful under some chaos test
chyezh commented 7 months ago

StepExecutor uses this uncrecoverable error to determine whether to continue repeating the task. It is currently difficult to decouple retry from StepExecutor by only modifying grpc client layer.


func (s *stepStack) Execute(ctx context.Context) *stepStack {
...
        if !retry.IsRecoverable(err) {
            if !skipLog {
                log.Warn("failed to execute step, not able to reschedule", zap.Error(err), zap.String("step", todo.Desc()))
            }
            return nil
        }
...
}

Meanwhile, there are some unreasonable error handling processes in grpcclient, and we may need to reorganize the service layer's functionality regarding retry and reconnect.

for example,

  1. context.Canceled or context.DeadlineExceeded should not be retried?
  2. ServerIDMismatched should not be retried?
func (c *ClientBase[T]) checkGrpcErr(ctx context.Context, err error) (needRetry, needReset bool, retErr error) {
    log := log.Ctx(ctx).With(zap.String("clientRole", c.GetRole()))
    // Unknown err
    if !funcutil.IsGrpcErr(err) {
        log.Warn("fail to grpc call because of unknown error", zap.Error(err))
        return false, false, err
    }

    // grpc err
    log.Warn("call received grpc error", zap.Error(err))
    switch {
    case funcutil.IsGrpcErr(err, codes.Canceled, codes.DeadlineExceeded):
        // canceled or deadline exceeded
        return true, c.needResetCancel(), err
    case funcutil.IsGrpcErr(err, codes.Unimplemented):
        return false, false, merr.WrapErrServiceUnimplemented(err)
    case IsServerIDMismatchErr(err):
        if ok := c.checkNodeSessionExist(ctx); !ok {
            // if session doesn't exist, no need to retry for datanode/indexnode/querynode/proxy
            return false, false, err
        }
        return true, true, err
    case IsCrossClusterRoutingErr(err):
        return true, true, err
    default:
        return true, true, err
    }
}
ThreadDao commented 7 months ago

fixed 2.3.11