Closed zhuwenxing closed 1 year ago
/assign @jaime0815 /unassign
/assign @weiliu1031
balancer triggers a channel task:
[2023/11/13 15:07:39.793 +00:00] [INFO] [balance/utils.go:98] ["Create Channel task"] [collection=445614781037316785] [replica=445615151646769279] [channel=by-dev-rootcoord-dml_0_445614781037316785v0] [From=-1] [To=15]
channel was watched successfully:
[2023/11/13 15:07:40.300 +00:00] [INFO] [querynode/impl.go:391] ["successfully watchDmChannelsTask"] [collectionID=445614781037316785] [nodeID=15] [channels="[by-dev-rootcoord-dml_0_445614781037316785v0]"]
However, query to node 22 failed, query shard not exist:
[2023/11/13 15:07:44.738 +00:00] [ERROR] [querynode/shard_cluster.go:896] ["Query 22 failed, reason query shard(channel) by-dev-rootcoord-dml_0_445614781037316785v0 does not exist\n err %!w(<nil>)"] [stack="github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Query\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/shard_cluster.go:896\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).queryWithDmlChannel\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:1158\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).Query.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:1243\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75"]
balancer triggers a channel-release task:
[2023/11/13 15:07:39.793 +00:00] [INFO] [balance/utils.go:98] ["Create Channel task"] [collection=445614781037316785] [replica=445615151646769279] [channel=by-dev-rootcoord-dml_0_445614781037316785v0] [From=-1] [To=15]
channel was released successfully:
[2023/11/13 15:07:44.288 +00:00] [INFO] [querynode/impl.go:459] ["unsubDmChannelTask WaitToFinish done"] [traceID=fc78855fd009402] [collectionID=445614781037316785] [channel=by-dev-rootcoord-dml_0_445614781037316785v0]
This lead to query failed, query shard not exist:
[2023/11/13 15:07:44.738 +00:00] [ERROR] [querynode/shard_cluster.go:896] ["Query 22 failed, reason query shard(channel) by-dev-rootcoord-dml_0_445614781037316785v0 does not exist\n err %!w(<nil>)"] [stack="github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Query\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/shard_cluster.go:896\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).queryWithDmlChannel\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:1158\ngithub.com/milvus-io/milvus/internal/querynode.(*QueryNode).Query.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:1243\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:75"]
So, why balancer would triggers a channel-release task? @weiliu1031
the search/query failed caused by channel has been unwatched, but it's not a balance task, it's much more complicated than balance.
the channel leak after release collection which is unexpcted, which is still in investigating
target node id is not match, expected 15, actual 22 (which not sub channel)
[2023/11/13 15:07:47.893 +00:00] [WARN] [proxy/task_query.go:404] ["invalid shard leaders cache, updating shardleader caches and retry query"] [traceID=31629654ae7227d] [error="code: UnexpectedError, error: fail to Query, QueryNode ID = 15, reason=Query 22 failed, reason query shard(channel) by-dev-rootcoord-dml_0_445614781037316785v0 does not exist\n err %!w(<nil>)"]
target node id is not match, expected 15, actual 22 (which not sub channel)
[2023/11/13 15:07:47.893 +00:00] [WARN] [proxy/task_query.go:404] ["invalid shard leaders cache, updating shardleader caches and retry query"] [traceID=31629654ae7227d] [error="code: UnexpectedError, error: fail to Query, QueryNode ID = 15, reason=Query 22 failed, reason query shard(channel) by-dev-rootcoord-dml_0_445614781037316785v0 does not exist\n err %!w(<nil>)"]
22 is a worker, which has a leaked channel
concurrent release segment cause query shard service release before channel release
fixed on v2.2.15
Is there an existing issue for this?
Environment
Current Behavior
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/1575/pipeline
log: artifacts-kafka-cluster-upgrade-1575-server-second-deployment-logs.tar.gz artifacts-kafka-cluster-upgrade-1575-server-first-deployment-logs.tar.gz
Anything else?
collection name: deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_only_growing_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000