milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.27k stars 2.91k forks source link

[Bug]: querynode restarted with error `Query Node disconnected from etcd, process will exit` when etcd works well #32593

Closed zhuwenxing closed 5 months ago

zhuwenxing commented 6 months ago

Is there an existing issue for this?

Environment

- Milvus version:master-20240424-4c7d3dad-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

image

[2024/04/24 17:57:34.674 +00:00] [DEBUG] [querynodev2/services.go:70] ["QueryNode current state"] [NodeID=13] [StateCode=Healthy]
[2024/04/24 17:57:34.675 +00:00] [DEBUG] [querynodev2/services.go:70] ["QueryNode current state"] [NodeID=13] [StateCode=Healthy]
[2024/04/24 17:57:34.674 +00:00] [DEBUG] [querynodev2/services.go:70] ["QueryNode current state"] [NodeID=13] [StateCode=Healthy]
[2024/04/24 17:57:34.674 +00:00] [WARN] [sessionutil/session_util.go:530] ["session keepalive channel closed"]
[2024/04/24 17:57:34.677 +00:00] [INFO] [querynodev2/services.go:1277] ["sync action"] [traceID=e2320380d9c1b555cdf294c330d6a56d] [collectionID=449309362374713171] [channel=by-dev-rootcoord-dml_2_449309362374713171v0] [currentNodeID=13] [Action=UpdateVersion] [TargetVersion=1713981420488608003]
[2024/04/24 17:57:34.677 +00:00] [INFO] [delegator/distribution.go:299] ["Update readable segment version"] [oldVersion=1713981410547297082] [newVersion=1713981420488608003] [growingSegmentNum=0] [sealedSegmentNum=4]
[2024/04/24 17:57:34.677 +00:00] [WARN] [grpclog/grpclog.go:46] ["[core][Server #6] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transport is closing\""]
[2024/04/24 17:57:34.678 +00:00] [DEBUG] [config/etcd_source.go:146] ["etcd refreshConfigurations"] [prefix=by-dev/config] [endpoints="[pulsar-pod-kill-13671-etcd:2379]"]
[2024/04/24 17:57:34.678 +00:00] [INFO] [querynodev2/services.go:1277] ["sync action"] [traceID=ad1b02e3c3165c5e11f9cc54a785cff7] [collectionID=449309362374713026] [channel=by-dev-rootcoord-dml_1_449309362374713026v0] [currentNodeID=13] [Action=UpdateVersion] [TargetVersion=1713981420488485473]
[2024/04/24 17:57:34.688 +00:00] [INFO] [delegator/distribution.go:299] ["Update readable segment version"] [oldVersion=1713981410541177780] [newVersion=1713981420488485473] [growingSegmentNum=0] [sealedSegmentNum=4]
[2024/04/24 17:57:34.688 +00:00] [WARN] [grpclog/grpclog.go:46] ["[core][Server #6] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transport is closing\""]
SIGPIPE Swallowed
[2024/04/24 17:57:34.688 +00:00] [WARN] [sessionutil/session_util.go:882] ["connection lost detected, shuting down"]
[2024/04/24 17:57:34.689 +00:00] [INFO] [querynodev2/services.go:1277] ["sync action"] [traceID=ad0b693a093074367275c8bbe64c72ef] [collectionID=449309362375116281] [channel=by-dev-rootcoord-dml_11_449309362375116281v0] [currentNodeID=13] [Action=UpdateVersion] [TargetVersion=1713981420490535443]
[2024/04/24 17:57:34.689 +00:00] [INFO] [delegator/distribution.go:299] ["Update readable segment version"] [oldVersion=1713981410560201875] [newVersion=1713981420490535443] [growingSegmentNum=0] [sealedSegmentNum=3]
[2024/04/24 17:57:34.689 +00:00] [WARN] [grpclog/grpclog.go:46] ["[core][Server #6] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transport is closing\""]
[2024/04/24 17:57:34.690 +00:00] [WARN] [grpclog/grpclog.go:46] ["[core][Server #6] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transport is closing\""]
[2024/04/24 17:57:34.690 +00:00] [WARN] [grpclog/grpclog.go:46] ["[core][Server #6] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transport is closing\""]
[2024/04/24 17:57:34.690 +00:00] [DEBUG] [querynodev2/services.go:70] ["QueryNode current state"] [NodeID=13] [StateCode=Healthy]
[2024/04/24 17:57:34.690 +00:00] [INFO] [querynodev2/services.go:1277] ["sync action"] [traceID=92b4afb6c394eee0fee7d4dcc1fbc01e] [collectionID=449309362374914020] [channel=by-dev-rootcoord-dml_1_449309362374914020v0] [currentNodeID=13] [Action=UpdateVersion] [TargetVersion=1713981420490613013]
[2024/04/24 17:57:34.690 +00:00] [WARN] [storage/remote_chunk_manager.go:179] ["failed to read object"] [path=file/delta_log/449309362374913458/449309362374913459/449309362374721454/449309362374723055] [error="IO failed[key=file/delta_log/449309362374913458/449309362374913459/449309362374721454/449309362374723055]: Get \"http://pulsar-pod-kill-13671-minio:9000/milvus-bucket/file/delta_log/449309362374913458/449309362374913459/449309362374721454/449309362374723055\": context canceled"]
[2024/04/24 17:57:34.690 +00:00] [ERROR] [querynodev2/server.go:170] ["Query Node disconnected from etcd, process will exit"] ["Server Id"=13] [stack="github.com/milvus-io/milvus/internal/querynodev2.(*QueryNode).Register.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querynodev2/server.go:170"]
[2024/04/24 17:57:34.690 +00:00] [WARN] [retry/retry.go:46] ["retry func failed"] [traceID=84856f96b65fbb3f43dd582405a118fb] [retried=0] [error="IO failed[key=file/delta_log/449309362374913458/449309362374913459/449309362374721454/449309362374723055]: Get \"http://pulsar-pod-kill-13671-minio:9000/milvus-bucket/file/delta_log/449309362374913458/449309362374913459/449309362374721454/449309362374723055\": context canceled"]
[2024/04/24 17:57:34.690 +00:00] [INFO] [delegator/distribution.go:299] ["Update readable segment version"] [oldVersion=1713981410554417353] [newVersion=1713981420490613013] [growingSegmentNum=1] [sealedSegmentNum=3]
[2024/04/24 17:57:34.690 +00:00] [DEBUG] [querynodev2/services.go:70] ["QueryNode current state"] [NodeID=13] [StateCode=Healthy]
[2024/04/24 17:57:34.690 +00:00] [INFO] [querynodev2/services.go:1277] ["sync action"] [traceID=197a39ac5111b6da964467a0276f17e3] [collectionID=449309362374915589] [channel=by-dev-rootcoord-dml_8_449309362374915589v1] [currentNodeID=13] [Action=UpdateVersion] [TargetVersion=1713981420492724778]
[2024/04/24 17:57:34.691 +00:00] [WARN] [grpclog/grpclog.go:46] ["[core][Server #6] grpc: Server.processUnaryRPC failed to write status: connection error: desc = \"transport is closing\""]
[2024/04/24 17:57:34.691 +00:00] [DEBUG] [querynodev2/services.go:70] ["QueryNode current state"] [NodeID=13] [StateCode=Healthy]

Perhaps it was due to the restart of the querynode that caused a search failure.

[2024-04-24T18:00:53.337Z] [2024-04-24 17:56:53 - DEBUG - ci_test]: (api_response) : <Collection>:

[2024-04-24T18:00:53.337Z] -------------

[2024-04-24T18:00:53.337Z] <name>: Checker__oI94luhb

[2024-04-24T18:00:53.337Z] <description>: 

[2024-04-24T18:00:53.337Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}......  (api_request.py:37)

[2024-04-24T18:00:53.337Z] [2024-04-24 17:56:53 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-04-24T18:00:53.337Z] [2024-04-24 17:56:55 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-04-24T18:00:53.337Z] [2024-04-24 17:56:55 - DEBUG - ci_test]: (api_request)  : [Collection.compact] args: [180], kwargs: {} (api_request.py:62)

[2024-04-24T18:00:53.337Z] [2024-04-24 17:56:55 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-04-24T18:00:53.337Z] [2024-04-24 17:56:56 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-04-24T18:00:53.337Z] [2024-04-24 17:56:56 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-04-24T18:00:53.337Z] [2024-04-24 17:56:56 - INFO - ci_test]: assert create collection: 0.014584064483642578, init_entities: 300000 (test_all_collections_after_chaos.py:49)

[2024-04-24T18:00:53.337Z] [2024-04-24 17:56:59 - DEBUG - ci_test]: (api_request)  : [Collection.insert] args: [[[-3000, -2999, -2998, -2997, -2996, -2995, -2994, -2993, -2992, -2991, -2990, -2989, -2988, -2987, -2986, -2985, -2984, -2983, -2982, -2981, -2980, -2979, -2978, -2977, -2976, -2975, -2974, -2973, -2972, -2971, -2970, -2969, -2968, -2967, -2966, -2965, -2964, -2963, -2962, -2961, -2960, -2959, -29......, kwargs: {'timeout': 180} (api_request.py:62)

[2024-04-24T18:00:53.337Z] [2024-04-24 17:57:00 - DEBUG - ci_test]: (api_response) : (insert count: 2000, delete count: 0, upsert count: 0, timestamp: 449309945384402945, success count: 2000, err count: 0)  (api_request.py:37)

[2024-04-24T18:00:53.337Z] [2024-04-24 17:57:00 - INFO - ci_test]: assert insert: 1.3706943988800049 (test_all_collections_after_chaos.py:57)

[2024-04-24T18:00:53.337Z] [2024-04-24 17:57:00 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-04-24T18:00:53.337Z] [2024-04-24 17:57:03 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-04-24T18:00:53.337Z] [2024-04-24 17:57:03 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-04-24T18:00:53.337Z] [2024-04-24 17:57:03 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-04-24T18:00:53.337Z] [2024-04-24 17:57:03 - INFO - ci_test]: assert flush: 3.032050132751465, entities: 302000 (test_all_collections_after_chaos.py:67)

[2024-04-24T18:00:53.337Z] [2024-04-24 17:57:03 - INFO - ci_test]: index info: [{'collection': 'Checker__oI94luhb', 'field': 'text_emb', 'index_name': 'text_emb', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__oI94luhb', 'field': 'voice_emb', 'index_name': 'voice_emb', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__oI94luhb', 'field': 'int64', 'index_name': 'int64', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__oI94luhb', 'field': 'float', 'index_name': 'float', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__oI94luhb', 'field': 'varchar', 'index_name': 'varchar', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__oI94luhb', 'field': 'float_vector', 'index_name': 'float_vector', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__oI94luhb', 'field': 'image_emb', 'index_name': 'image_emb', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}] (test_all_collections_after_chaos.py:71)

[2024-04-24T18:00:53.338Z] [2024-04-24 17:57:03 - INFO - ci_test]: index info: [{'collection': 'Checker__oI94luhb', 'field': 'image_emb', 'index_name': 'image_emb', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__oI94luhb', 'field': 'text_emb', 'index_name': 'text_emb', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__oI94luhb', 'field': 'voice_emb', 'index_name': 'voice_emb', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__oI94luhb', 'field': 'int64', 'index_name': 'int64', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__oI94luhb', 'field': 'float', 'index_name': 'float', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__oI94luhb', 'field': 'varchar', 'index_name': 'varchar', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__oI94luhb', 'field': 'float_vector', 'index_name': 'float_vector', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}] (test_all_collections_after_chaos.py:86)

[2024-04-24T18:00:53.338Z] [2024-04-24 17:57:03 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 180], kwargs: {} (api_request.py:62)

[2024-04-24T18:00:53.338Z] [2024-04-24 17:57:03 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-04-24T18:00:53.338Z] [2024-04-24 17:57:03 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.027614593030906867, 0.07137447116146449, 0.039864347297293205, 0.11206226861329231, 0.04861947114976575, 0.0800943730252056, 0.03057080320098264, 0.03201841173916029, 0.056693498505691255, 0.0737252762970744, 0.03802721014755368, 0.044079861488757005, 0.03367181227185206, 0.12353653482548768, 0......, kwargs: {} (api_request.py:62)

[2024-04-24T18:00:53.338Z] [2024-04-24 17:57:31 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=901, message=stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:90 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:202 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:682 github.com/milvus-io/milvus/internal/proxy.(*searchTask).searchShard

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:180 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:154 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:213 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute.func2: node not found)>, <Time:{'RPC start': '2024-04-24 17:57:03.975172', 'RPC error': '2024-04-24 17:57:31.694577'}> (decorators.py:139)

[2024-04-24T18:00:53.338Z] [2024-04-24 17:57:31 - ERROR - ci_test]: Traceback (most recent call last):

[2024-04-24T18:00:53.338Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-04-24T18:00:53.338Z]     res = func(*args, **_kwargs)

[2024-04-24T18:00:53.338Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-04-24T18:00:53.338Z]     return func(*arg, **kwargs)

[2024-04-24T18:00:53.338Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 799, in search

[2024-04-24T18:00:53.338Z]     resp = conn.search(

[2024-04-24T18:00:53.338Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 140, in handler

[2024-04-24T18:00:53.338Z]     raise e from e

[2024-04-24T18:00:53.338Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2024-04-24T18:00:53.338Z]     return func(*args, **kwargs)

[2024-04-24T18:00:53.338Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 175, in handler

[2024-04-24T18:00:53.338Z]     return func(self, *args, **kwargs)

[2024-04-24T18:00:53.338Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 115, in handler

[2024-04-24T18:00:53.338Z]     raise e from e

[2024-04-24T18:00:53.338Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 86, in handler

[2024-04-24T18:00:53.338Z]     return func(*args, **kwargs)

[2024-04-24T18:00:53.338Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 798, in search

[2024-04-24T18:00:53.338Z]     return self._execute_search(request, timeout, round_decimal=round_decimal, **kwargs)

[2024-04-24T18:00:53.338Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 739, in _execute_search

[2024-04-24T18:00:53.338Z]     raise e from e

[2024-04-24T18:00:53.338Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 732, in _execute_search

[2024-04-24T18:00:53.338Z]     check_status(response.status)

[2024-04-24T18:00:53.338Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 62, in check_status

[2024-04-24T18:00:53.338Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-04-24T18:00:53.338Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=901, message=stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:90 github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:202 github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).Search

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:682 github.com/milvus-io/milvus/internal/proxy.(*searchTask).searchShard

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:180 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:154 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry

[2024-04-24T18:00:53.338Z] /go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:213 github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute.func2: node not found)>

[2024-04-24T18:00:53.338Z]  (api_request.py:45)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/13671/pipeline log: artifacts-pulsar-pod-kill-13671-server-logs.tar.gz

Anything else?

No response

zhuwenxing commented 6 months ago

another failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_cron/detail/deploy_test_cron/2206/pipeline image

I20240424 09:49:01.114799  2813 time_recorder.cc:49] [KNOWHERE][PrintTimeRecord][milvus] Build index: done (108.956747 ms)
[2024/04/24 09:49:01.129 +00:00] [DEBUG] [delegator/delegator_data.go:158] ["insert into growing segment"] [collectionID=449302046281116611] [channel=by-dev-rootcoord-dml_12_449302046281116611v1] [replicaID=449302069506736136] [collectionID=449302046281116611] [segmentID=449302046285937186] [rowCount=1549] [maxTimestamp=449302263546511362]
I20240424 09:49:01.144289  2875 thread_pool.h:53] [KNOWHERE][operator()][knowhere_build1] Successfully set priority of knowhere thread.
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=DropCollection] [channel=by-dev-rootcoord-dml_11_449302046284927885v0] [collectionID=449302046284927885] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=DropCollection] [channel=by-dev-rootcoord-dml_11_449302046284324584v0] [collectionID=449302046284324584] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=DropCollection] [channel=by-dev-rootcoord-dml_11_449302046284123108v0] [collectionID=449302046284123108] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=CreateCollection] [channel=by-dev-rootcoord-dml_11_449302046284324584v0] [collectionID=449302046284324584] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=CreateCollection] [channel=by-dev-rootcoord-dml_11_449302046284123108v0] [collectionID=449302046284123108] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=DropCollection] [channel=by-dev-rootcoord-dml_11_449302046283721085v0] [collectionID=449302046283721085] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=DropCollection] [channel=by-dev-rootcoord-dml_11_449302046283720200v0] [collectionID=449302046283720200] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=CreateCollection] [channel=by-dev-rootcoord-dml_11_449302046283721085v0] [collectionID=449302046283721085] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=CreateCollection] [channel=by-dev-rootcoord-dml_11_449302046283720200v0] [collectionID=449302046283720200] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
[2024/04/24 09:49:01.152 +00:00] [DEBUG] [pipeline/filter_node.go:90] ["filter invalid message"] ["message type"=CreateCollection] [channel=by-dev-rootcoord-dml_11_449302046284927885v0] [collectionID=449302046284927885] [error="invalid parameter[expected=msgType is Insert or Delete][actual=not]"]
I20240424 09:49:01.159169  2814 time_recorder.cc:49] [KNOWHERE][PrintTimeRecord][milvus] Build index: done (146.126908 ms)
I20240424 09:49:01.178300  2849 time_recorder.cc:49] [KNOWHERE][PrintTimeRecord][milvus] Build index: done (110.236858 ms)
I20240424 09:49:01.218554  2813 time_recorder.cc:49] [KNOWHERE][PrintTimeRecord][milvus] Build index: done (98.580036 ms)
[2024/04/24 09:49:01.242 +00:00] [WARN] [sessionutil/session_util.go:553] ["fail to retry keepAliveOnce"] [serverName=querynode] [LeaseID=9064777446526199336] [error="etcdserver: requested lease not found"]
[2024/04/24 09:49:01.242 +00:00] [WARN] [sessionutil/session_util.go:882] ["connection lost detected, shuting down"]
[2024/04/24 09:49:01.242 +00:00] [ERROR] [querynodev2/server.go:170] ["Query Node disconnected from etcd, process will exit"] ["Server Id"=7] [stack="github.com/milvus-io/milvus/internal/querynodev2.(*QueryNode).Register.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querynodev2/server.go:170"]
xiaofan-luan commented 6 months ago

I do wan't do a big modification on cgo. to make sure each go pthread is never blocked

yanliang567 commented 6 months ago

/assign @congqixia /unassign

xiaofan-luan commented 6 months ago

@zhuwenxing where can we see this log? "requested lease not found" checked the artifacts but didn't see a clue

congqixia commented 6 months ago

image Right before the etcd lease expire, the abnormal thread burst was observed: Go thread image Container thread image

congqixia commented 6 months ago

With the help from @cqy123456, we confirmed that the KNOWHERE search/build pool size was very high: image Build pool size = 32, and search pool size = 256 However the pod shared the node with many other service and cannot use actually 64 cores image image 288 threads of cpp make lease keep alive almost impossible to do

@zhuwenxing let's add limit & request for these chaos pods and retest /assign @zhuwenxing

zhuwenxing commented 5 months ago

Not reproduced for a long time.