Closed ThreadDao closed 7 months ago
delegator on laion1b-test-2-milvus-querynode-1-5956458d77-jpf7x fail to apply delete on segment 447902879710617858 at node laion1b-test-2-milvus-querynode-1-5956458d77-4bntq. make the delegator distribution view unavailable.
error: Node Not Found.
2024-02-23 16:19:15.124 | querynodequerynode4am-node23laion1b-test-2-milvus-querynode-1-5956458d77-jpf7x | [2024/02/23 08:19:15.124 +00:00] [DEBUG] [delegator/delegator_data.go:257] ["delegator plan to applyDelete via worker"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879710617858] [workerID=1651] |
-- | -- | -- | --
| | 2024-02-23 16:19:15.327 | querynodequerynode4am-node23laion1b-test-2-milvus-querynode-1-5956458d77-jpf7x | [2024/02/23 08:19:15.327 +00:00] [WARN] [delegator/delegator_data.go:274] ["try to delete data on non-exist node"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879710617858] [workerID=1651] |
| | 2024-02-23 16:19:15.327 | querynodequerynode4am-node23laion1b-test-2-milvus-querynode-1-5956458d77-jpf7x | [2024/02/23 08:19:15.327 +00:00] [WARN] [delegator/delegator_data.go:288] ["apply delete for segment failed, marking it offline"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879710617858] [workerID=1651] |
| | 2024-02-23 16:19:15.368 | querynodequerynode4am-node23laion1b-test-2-milvus-querynode-1-5956458d77-jpf7x | [2024/02/23 08:19:15.368 +00:00] [WARN] [delegator/delegator_data.go:238] ["failed to apply delete, mark segment offline"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [offlineSegments="[447902879710617858]"]
Session not found.
2024-02-23 16:19:15.327 | (no unique labels) | [2024/02/23 08:19:15.326 +00:00] [WARN] [cluster/worker.go:114] ["failed to call Delete, worker return error"] [workerID=1651] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:87 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:328 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Delete\n/go/src/github.com/milvus-io/milvus/internal/querynodev2/cluster/worker.go:108 github.com/milvus-io/milvus/internal/querynodev2/cluster.(*remoteWorker).Delete\n/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:263 github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).applyDelete.func1\n/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:258 github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).applyDelete\n/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:212 github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).ProcessDelete.func1: attempt #0: session not found->querynode-1651: node not found[node=1651]: unrecoverable error"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: attempt #0: session not found->querynode-1651: node not found[node=1651]: unrecoverable error\n(1) attached stack trace\n -- stack trace:\n \| github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n \| \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550\n \| github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n \| \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564\n \| github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n \| \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:87\n \| github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Delete\n \| \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:328\n \| github.com/milvus-io/milvus/internal/querynodev2/cluster.(*remoteWorker).Delete\n \| \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/cluster/worker.go:108\n \| github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).applyDelete.func1\n \| \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:263\n \| github.com/milvus-io/milvus/pkg/util/retry.Do\n \| \t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44\n \| github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).applyDelete\n \| \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:258\n \| github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).ProcessDelete.func1\n \| \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:212\n \| golang.org/x/sync/errgroup.(*Group).Go.func1\n \| \t/go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75\n \| runtime.goexit\n \| \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n \| /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:550 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n \| /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n \| /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:87 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n \| /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:328 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Delete\n \| /go/src/github.com/milvus-io/milvus/internal/querynodev2/cluster/worker.go:108 github.com/milvus-io/milvus/internal/querynodev2/cluster.(*remoteWorker).Delete\n \| /go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:263 github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).applyDelete.func1\n \| /go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n \| /go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:258 github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).applyDelete\n \| /go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:212 github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).ProcessDelete.func1\nWraps: (3) attempt #0: session not found->querynode-1651: node not found[node=1651]: unrecoverable error\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) merr.multiErrors"]
-- | -- | --
one failure cause continuous unserviceable.
2024-02-23 16:19:15.124 | (no unique labels) | [2024/02/23 08:19:15.124 +00:00] [DEBUG] [delegator/delegator_data.go:257] ["delegator plan to applyDelete via worker"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879710617858] [workerID=1651] |
-- | -- | -- | --
| | 2024-02-23 16:19:15.327 | (no unique labels) | [2024/02/23 08:19:15.327 +00:00] [WARN] [delegator/delegator_data.go:288] ["apply delete for segment failed, marking it offline"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879710617858] [workerID=1651] |
| | 2024-02-23 16:19:15.327 | (no unique labels) | [2024/02/23 08:19:15.327 +00:00] [DEBUG] [delegator/delegator_data.go:257] ["delegator plan to applyDelete via worker"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879711838680] [workerID=1651] |
| | 2024-02-23 16:19:15.330 | (no unique labels) | [2024/02/23 08:19:15.330 +00:00] [DEBUG] [delegator/delegator_data.go:257] ["delegator plan to applyDelete via worker"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879711326791] [workerID=1651] |
| | 2024-02-23 16:19:15.332 | (no unique labels) | [2024/02/23 08:19:15.332 +00:00] [DEBUG] [delegator/delegator_data.go:257] ["delegator plan to applyDelete via worker"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879711851858] [workerID=1651] |
| | 2024-02-23 16:19:15.333 | (no unique labels) | [2024/02/23 08:19:15.333 +00:00] [DEBUG] [delegator/delegator_data.go:257] ["delegator plan to applyDelete via worker"] [collectionID=447902879639453431] [channel=laion1b-test-2-rootcoord-dml_5_447902879639453431v1] [replicaID=447902888139161603] [segmentID=447902879711326800] [workerID=1651]
Bug here.
Querynode use rpc-level context to GetWorker.
// GetWorker returns worker with specified nodeID.
func (m *grpcWorkerManager) GetWorker(ctx context.Context, nodeID int64) (Worker, error) {
...
worker, err = m.builder(ctx, nodeID)
...
}
Build a grpc client may have a context with previous grpc context.
func NewClient(ctx context.Context, addr string, nodeID int64) (*Client, error) {
....
sess := sessionutil.NewSession(ctx)
...
}
session_util use these context to handle the session operation, such as verifySession.
Once the grpc client is disconnected with server, all following reconnect operation break down with a context.Canceled
error.
func (c *ClientBase[T]) verifySession(ctx context.Context) error {
....
sessions, _, getSessionErr := c.sess.GetSessions(c.GetRole())
....
}
log can be found like this.
2024-02-23 16:19:15.325 | (no unique labels) | [2024/02/23 08:19:15.324 +00:00] [WARN] [grpcclient/client.go:449] ["fail to get grpc client"] [client_role=querynode-1651] [error="failed to connect 10.104.32.44:21123, reason: context deadline exceeded"] |
-- | -- | -- | --
| | 2024-02-23 16:19:15.325 | (no unique labels) | [2024/02/23 08:19:15.325 +00:00] [WARN] [grpcclient/client.go:373] ["fail to get session"] [clientRole=querynode-1651] [error="context canceled"] |
| | 2024-02-23 16:19:15.325 | (no unique labels) | [2024/02/23 08:19:15.325 +00:00] [WARN] [grpcclient/client.go:431] ["failed to verify node session"] [error="session not found->querynode-1651: node not found[node=1651]"] [errorVerbose="session not found->querynode-1651: node not found[node=1651]\n(1) attached stack trace\n -- stack trace:\n \| github.com/milvus-io/milvus/pkg/util/merr.WrapErrNodeNotFound\n \| \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:684\n \| github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).verifySession\n \| \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:383\n \| github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).checkNodeSessionExist\n \| \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:429\n \| github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).call.func2\n \| \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:464\n \| github.com/milvus-io/milvus/pkg/util/retry.Do\n \| \t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44\n \| github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).call\n \| \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:462\n \| github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n \| \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:548\n \| github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n \| \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564\n \| github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]\n \| \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:87\n \| github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Delete\n \| \t/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:328\n \| github.com/milvus-io/milvus/internal/querynodev2/cluster.(*remoteWorker).Delete\n \| \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/cluster/worker.go:108\n \| github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).applyDelete.func1\n \| \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:263\n \| github.com/milvus-io/milvus/pkg/util/retry.Do\n \| \t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44\n \| github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).applyDelete\n \| \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:258\n \| github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).ProcessDelete.func1\n \| \t/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator_data.go:212\n \| golang.org/x/sync/errgroup.(*Group).Go.func1\n \| \t/go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75\n \| runtime.goexit\n \| \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) session not found->querynode-1651\nWraps: (3) node not found[node=1651]\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) merr.milvusError"]
session not found than convert into a node not found. an unrecoverable error happens.
StepExecutor
uses this uncrecoverable
error to determine whether to continue repeating the task.
It is currently difficult to decouple retry from StepExecutor by only modifying grpc client layer.
func (s *stepStack) Execute(ctx context.Context) *stepStack {
...
if !retry.IsRecoverable(err) {
if !skipLog {
log.Warn("failed to execute step, not able to reschedule", zap.Error(err), zap.String("step", todo.Desc()))
}
return nil
}
...
}
Meanwhile, there are some unreasonable error handling processes in grpcclient, and we may need to reorganize the service layer's functionality regarding retry
and reconnect
.
for example,
context.Canceled
or context.DeadlineExceeded
should not be retried?ServerIDMismatched
should not be retried?func (c *ClientBase[T]) checkGrpcErr(ctx context.Context, err error) (needRetry, needReset bool, retErr error) {
log := log.Ctx(ctx).With(zap.String("clientRole", c.GetRole()))
// Unknown err
if !funcutil.IsGrpcErr(err) {
log.Warn("fail to grpc call because of unknown error", zap.Error(err))
return false, false, err
}
// grpc err
log.Warn("call received grpc error", zap.Error(err))
switch {
case funcutil.IsGrpcErr(err, codes.Canceled, codes.DeadlineExceeded):
// canceled or deadline exceeded
return true, c.needResetCancel(), err
case funcutil.IsGrpcErr(err, codes.Unimplemented):
return false, false, merr.WrapErrServiceUnimplemented(err)
case IsServerIDMismatchErr(err):
if ok := c.checkNodeSessionExist(ctx); !ok {
// if session doesn't exist, no need to retry for datanode/indexnode/querynode/proxy
return false, false, err
}
return true, true, err
case IsCrossClusterRoutingErr(err):
return true, true, err
default:
return true, true, err
}
}
fixed 2.3.11
Is there an existing issue for this?
Environment
Current Behavior
laion_stable_4
has about 60m-768d data and it's schema is:Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
Anything else?
No response