milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.03k stars 2.95k forks source link

[Bug]: Search failed with error `ShardNotAvailable` after querynode pod kill chaos test #22207

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:2.2.0-20230215-cad39ebf
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):both    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2023-02-16T02:46:19.867Z] [2023-02-16 02:46:12 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 120], kwargs: {} (api_request.py:56)
[2023-02-16T02:46:19.867Z] [2023-02-16 02:46:12 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)
[2023-02-16T02:46:19.867Z] [2023-02-16 02:46:12 - INFO - ci_test]: [test][2023-02-16T02:46:12Z] [0.02754334s] SearchChecker__3xjaw1he load -> None (wrapper.py:30)
[2023-02-16T02:46:19.867Z] [2023-02-16 02:46:12 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.019995630853342855, 0.08971894917509969, 0.10728045490814678, 0.12477988628506687, 0.03588064461552008, 0.01609225450988436, 0.03858843477158963, 0.09064269019539943, 0.11312084418397149, 0.06481990053875517, 0.15600821660226832, 0.06821425428847422, 0.012591748586106767, 0.029454599706087388, ......, kwargs: {} (api_request.py:56)
[2023-02-16T02:46:19.867Z] [2023-02-16 02:46:13 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=16, reason=ShardNotAvailable(replica=439488301896826890, shard=by-dev-rootcoord-dml_22_439488300704272441v0))>, <Time:{'RPC start': '2023-02-16 02:46:12.968491', 'RPC error': '2023-02-16 02:46:13.180845'}> (decorators.py:108)
[2023-02-16T02:46:19.867Z] [2023-02-16 02:46:13 - ERROR - ci_test]: Traceback (most recent call last):
[2023-02-16T02:46:19.867Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper
[2023-02-16T02:46:19.867Z]     res = func(*args, **_kwargs)
[2023-02-16T02:46:19.867Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request
[2023-02-16T02:46:19.867Z]     return func(*arg, **kwargs)
[2023-02-16T02:46:19.867Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 614, in search
[2023-02-16T02:46:19.867Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,
[2023-02-16T02:46:19.867Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
[2023-02-16T02:46:19.867Z]     raise e
[2023-02-16T02:46:19.867Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
[2023-02-16T02:46:19.867Z]     return func(*args, **kwargs)
[2023-02-16T02:46:19.867Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
[2023-02-16T02:46:19.867Z]     ret = func(self, *args, **kwargs)
[2023-02-16T02:46:19.867Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler
[2023-02-16T02:46:19.867Z]     raise e
[2023-02-16T02:46:19.867Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
[2023-02-16T02:46:19.867Z]     return func(self, *args, **kwargs)
[2023-02-16T02:46:19.867Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 472, in search
[2023-02-16T02:46:19.867Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)
[2023-02-16T02:46:19.867Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 441, in _execute_search_requests
[2023-02-16T02:46:19.867Z]     raise pre_err
[2023-02-16T02:46:19.867Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 432, in _execute_search_requests
[2023-02-16T02:46:19.867Z]     raise MilvusException(response.status.error_code, response.status.reason)
[2023-02-16T02:46:19.867Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=16, reason=ShardNotAvailable(replica=439488301896826890, shard=by-dev-rootcoord-dml_22_439488300704272441v0))>
[2023-02-16T02:46:19.867Z]  (api_request.py:39)
[2023-02-16T02:46:19.867Z] [2023-02-16 02:46:13 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=16, reason=ShardNotAvailable(replica=439488301896826890, shard=by-dev-rootcoord-dml_22_439488300704272441v0))> (api_request.py:40)
[2023-02-16T02:46:19.867Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------
[2023-02-16T02:46:19.867Z] =========================== short test summary info ============================
[2023-02-16T02:46:19.867Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[QueryChecker__YVyjnRL5] - AssertionError
[2023-02-16T02:46:19.867Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__nd8z4dCu] - AssertionError
[2023-02-16T02:46:19.867Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__6DFWQNgj] - AssertionError
[2023-02-16T02:46:19.867Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__LNCbFub1] - AssertionError
[2023-02-16T02:46:19.868Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__8RjmHrVJ] - AssertionError
[2023-02-16T02:46:19.868Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[Checker__Lb1fwEkQ] - AssertionError
[2023-02-16T02:46:19.868Z] FAILED testcases/test_all_collections_after_chaos.py::TestAllCollection::test_milvus_default[SearchChecker__3xjaw1he] - AssertionError
[2023-02-16T02:46:19.868Z] ==================== 7 failed, 5 passed in 97.74s (0:01:37) ====================

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

chaos type: pod-kill image tag: 2.2.0-20230215-cad39ebf target pod: querynode failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-for-release-cron/detail/chaos-test-kafka-for-release-cron/2015/pipeline

log: artifacts-querynode-pod-kill-2015-server-logs.tar.gz artifacts-querynode-pod-kill-2015-pytest-logs.tar.gz

Anything else?

It is a stable reproduced issue with this image tag. But it works well with 2.2.0-20230214-c333ee8d, so the problem is introduced yesterday.

The PR https://github.com/milvus-io/milvus/pull/22154 is suspicious

@congqixia Can you please take a look

zhuwenxing commented 1 year ago

another failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-kafka-for-release-cron/detail/chaos-test-kafka-for-release-cron/1999/pipeline log: artifacts-querynode-pod-kill-1999-server-logs.tar.gz artifacts-querynode-pod-kill-1999-pytest-logs.tar.gz

congqixia commented 1 year ago

/unassign @yanliang567 /assign ok, working on it

congqixia commented 1 year ago

@zhuwenxing patch pr has been merged. Could you please verify? /assign @zhuwenxing

zhuwenxing commented 1 year ago

The error log changed to fail to Search, QueryNode ID=19, reason=Search 21 failed, reason QueryNode 22 can't serve, recovering. And I take a deep look at logs, finding out that one querynode panic with error

[2023/02/20 22:08:06.100 +00:00] [DEBUG] [pulsar/pulsar_client.go:135] ["tr/create consumer"] [msg="create consumer done"] [duration=455.682623ms]
[2023/02/20 22:08:06.100 +00:00] [INFO] [msgstream/mq_msgstream.go:880] ["MsgStream begin to seek start msg: "] [channel=by-dev-rootcoord-dml_16] [MessageID=CBIQ7hcYACAA]
time="2023-02-20T22:08:06Z" level=info msg="Broker notification of Closed consumer: 121" local_addr="10.102.7.128:55020" remote_addr="pulsar://querynode-pod-kill-2066-pulsar-proxy:6650"
time="2023-02-20T22:08:06Z" level=info msg="[Reconnecting to broker in  117.469866ms]" consumerID=121 name=dgsbd subscription=by-dev-queryNode-439597179875298859-21 topic="persistent://public/default/by-dev-rootcoord-dml_16"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x2bb7110]

goroutine 101742 [running]:
github.com/milvus-io/milvus/internal/querynode.(*distribution).GetCurrent(0x0, {0x0?, 0x7f54fd140190?, 0xc001b25408?})
    /go/src/github.com/milvus-io/milvus/internal/querynode/distribution.go:98 +0x50
github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).GetSegmentInfos(0xc000263d40)
    /go/src/github.com/milvus-io/milvus/internal/querynode/shard_cluster.go:866 +0x36
github.com/milvus-io/milvus/internal/querynode.(*QueryNode).GetDataDistribution(0xc00084e1e0, {0x3c4a380?, 0x3d1a1e0?}, 0xc0036c2750)
    /go/src/github.com/milvus-io/milvus/internal/querynode/impl.go:1348 +0xd92
github.com/milvus-io/milvus/internal/distributed/querynode.(*Server).GetDataDistribution(0xf?, {0x41ebc58?, 0xc0036c2870?}, 0x10?)
    /go/src/github.com/milvus-io/milvus/internal/distributed/querynode/service.go:337 +0x2b
github.com/milvus-io/milvus/internal/proto/querypb._QueryNode_GetDataDistribution_Handler.func1({0x41ebc58, 0xc0036c2870}, {0x3c39ca0?, 0xc0036c2750})
    /go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:5811 +0x78
github.com/milvus-io/milvus/internal/util/logutil.UnaryTraceLoggerInterceptor({0x41ebc58?, 0xc0036c27e0?}, {0x3c39ca0, 0xc0036c2750}, 0x41d60c0?, 0xc002f5d920)
    /go/src/github.com/milvus-io/milvus/internal/util/logutil/grpc_interceptor.go:22 +0x49
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x41ebc58?, 0xc0036c27e0?}, {0x3c39ca0?, 0xc0036c2750?})
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x3a
github.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1({0x41ebc58, 0xc0036c2720}, {0x3c39ca0, 0xc0036c2750}, 0xc002f6fcc0?, 0xc002f6fce0)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/tracing/opentracing/server_interceptors.go:38 +0x16a
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x41ebc58?, 0xc0036c2720?}, {0x3c39ca0?, 0xc0036c2750?})
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x3a
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x41ebc58, 0xc0036c2720}, {0x3c39ca0, 0xc0036c2750}, 0xc002d95af0?, 0x3a599e0?)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34 +0xbf
github.com/milvus-io/milvus/internal/proto/querypb._QueryNode_GetDataDistribution_Handler({0x3d1a1e0?, 0xc00055e0c0}, {0x41ebc58, 0xc0036c2720}, 0xc0031ac5a0, 0xc0014e40f0)
    /go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:5813 +0x138
google.golang.org/grpc.(*Server).processUnaryRPC(0xc00014f500, {0x41fc430, 0xc001684820}, 0xc002d76000, 0xc0014e4270, 0x56c4180, 0x0)
    /go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:1283 +0xcfd
google.golang.org/grpc.(*Server).handleStream(0xc00014f500, {0x41fc430, 0xc001684820}, 0xc002d76000, 0x0)
    /go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:1620 +0xa1b
google.golang.org/grpc.(*Server).serveStreams.func1.2()
    /go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:922 +0x98
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:920 +0x28a

image

zhuwenxing commented 1 year ago

If this new error is not caused by the same root cause, I would like to open a new issue to trace it.

zhuwenxing commented 1 year ago

/assign @congqixia

zhuwenxing commented 1 year ago

chaos type: pod-kill image tag: 2.2.0-20230220-cd2e7fa5 target pod: querynode failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/2066/pipeline log: artifacts-querynode-pod-kill-2066-server-logs (1).tar.gz

artifacts-querynode-pod-kill-2066-pytest-logs.tar.gz

congqixia commented 1 year ago

shall be the same root cause of #22317

zhuwenxing commented 1 year ago

verified and passed with 2.2.0-20230222-ec20fec1