milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.3k stars 2.81k forks source link

[Bug]: [Nightly]Search failed reporting 'fail to search on all shard leaders' and 'target node id not match' #20970

Closed NicoYuan1986 closed 1 year ago

NicoYuan1986 commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:25a3b9a
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2):2.3.0.dev15
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Search failed reporting 'fail to search on all shard leaders' and 'target node id not match'.

[2022-12-04T18:55:15.240Z] self = <pymilvus.client.asynch.ChunkedSearchFuture object at 0x7f951bf37a30>
[2022-12-04T18:55:15.240Z] response = [status {
[2022-12-04T18:55:15.240Z]   error_code: UnexpectedError
[2022-12-04T18:55:15.240Z]   reason: "fail to search on all shard leaders, err=fail to Search, QueryNode ID=13, reason=Search 15 failed, reason target node id not match target id = 13, node id = 15 err %!w(<nil>)"
[2022-12-04T18:55:15.240Z] }
[2022-12-04T18:55:15.240Z] ]
[2022-12-04T18:55:15.240Z] 
[2022-12-04T18:55:15.240Z]     def on_response(self, response):
[2022-12-04T18:55:15.240Z]         for raw in response:
[2022-12-04T18:55:15.241Z]             if raw.status.error_code != 0:
[2022-12-04T18:55:15.241Z] >               raise MilvusException(raw.status.error_code, raw.status.reason)
[2022-12-04T18:55:15.241Z] E               pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=13, reason=Search 15 failed, reason target node id not match target id = 13, node id = 15 err %!w(<nil>))>
[2022-12-04T18:55:15.241Z] 
[2022-12-04T18:55:15.241Z] /usr/local/lib/python3.8/site-packages/pymilvus/client/asynch.py:245: MilvusException

Expected Behavior

pass

Steps To Reproduce

https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/210/pipeline/155/

Milvus Log

  1. log: artifacts-milvus-distributed-kafka-nightly-210-pymilvus-e2e-logs.tar.gz
  2. collection name: search_collection_IiDOejMk
  3. failed time: [2022-12-04T18:53:03.870Z] [gw3] [ 99%] FAILED testcases/test_search.py::TestsearchDiskann::test_search_with_delete_data[128-False-True]

Anything else?

No response

NicoYuan1986 commented 1 year ago

And milvus crashed with message 'unexpected signal during runtime execution':

2022-12-05T02:50:52.212605241+08:00 stdout F 2022-12-04 18:50:52,212 | WARNING | default | io_getevents() failed; returned 2, expected=6, ernno=-2=Unknown error -2, try #2
2022-12-05T02:50:52.234590327+08:00 stdout F 2022-12-04 18:50:52,234 | WARNING | default | io_getevents() failed; returned -4, expected=6, ernno=4=Interrupted system call, try #2
2022-12-05T02:50:52.390856611+08:00 stderr F fatal error: unexpected signal during runtime execution
2022-12-05T02:50:52.393004535+08:00 stderr F [signal SIGBUS: bus error code=0x80 addr=0x0 pc=0x7fa64f6daf89]
2022-12-05T02:50:52.393028011+08:00 stderr F
2022-12-05T02:50:52.393036049+08:00 stderr F runtime stack:
2022-12-05T02:50:52.393043074+08:00 stderr F runtime.throw({0x3491a6f?, 0x7fa448000080?})
2022-12-05T02:50:52.393049177+08:00 stderr F    /usr/local/go/src/runtime/panic.go:992 +0x71
2022-12-05T02:50:52.393055431+08:00 stderr F runtime.sigpanic()
2022-12-05T02:50:52.393067709+08:00 stderr F    /usr/local/go/src/runtime/signal_unix.go:802 +0x3a9
2022-12-05T02:50:52.39307386+08:00 stderr F
2022-12-05T02:50:52.393080707+08:00 stderr F goroutine 161342 [syscall]:
2022-12-05T02:50:52.39309964+08:00 stderr F runtime.cgocall(0x2c668a0, 0xc001f6ec38)
2022-12-05T02:50:52.3931078+08:00 stderr F      /usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc001f6ec10 sp=0xc001f6ebd8 pc=0x116a7bc
2022-12-05T02:50:52.393124347+08:00 stderr F github.com/milvus-io/milvus/internal/querynode._Cfunc_AppendIndex(0x7fa448002e00, 0x7fa4480ecdb0)
2022-12-05T02:50:52.393148717+08:00 stderr F    _cgo_gotypes.go:178 +0x55 fp=0xc001f6ec38 sp=0xc001f6ec10 pc=0x2acaef5
2022-12-05T02:50:52.393156548+08:00 stderr F github.com/milvus-io/milvus/internal/querynode.(*LoadIndexInfo).appendIndexData.func5(0x7fa448026c00?, 0x343fb9a?)
2022-12-05T02:50:52.393161036+08:00 stderr F    /go/src/github.com/milvus-io/milvus/internal/querynode/load_index_info.go:199 +0x65 fp=0xc001f6ec78 sp=0xc001f6ec38 pc=0x2ad4be5
2022-12-05T02:50:52.393179486+08:00 stderr F github.com/milvus-io/milvus/internal/querynode.(*LoadIndexInfo).appendIndexData(0xc001f6ee00?, {0x0, 0x0, 0xc0010de7a0?}, {0xc0029e1500, 0x7, 0x7fa4480e9f00?})
2022-12-05T02:50:52.393191747+08:00 stderr F    /go/src/github.com/milvus-io/milvus/internal/querynode/load_index_info.go:199 +0x145 fp=0xc001f6ed20 sp=0xc001f6ec78 pc=0x2ad4905
zhuwenxing commented 1 year ago

the same issue reason=Search 3 failed, reason target node id not match target id = 5, node id = 3

[2022-12-05T02:54:28.958Z] [2022-12-05 02:54:28 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[[0.07447882752883173, 0.02570878890426706, 0.03888294360793692, 0.035221074271393774, 0.07252698312335111, 0.04305678086379304, 0.14457217322654412, 0.13641078413367644, 0.024980791929894622, 0.09463596474025252, 0.08664793500249682, 0.025301509371140053, 0.10023629038664394, 0.029726568554700104,......, kwargs: {} (api_request.py:56)

[2022-12-05T02:54:28.958Z] [2022-12-05 02:54:28 - ERROR - pymilvus.decorators]: RPC error: [search], <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=5, reason=Search 3 failed, reason target node id not match target id = 5, node id = 3 err %!w(<nil>))>, <Time:{'RPC start': '2022-12-05 02:54:28.022335', 'RPC error': '2022-12-05 02:54:28.609911'}> (decorators.py:108)

[2022-12-05T02:54:28.958Z] [2022-12-05 02:54:28 - ERROR - ci_test]: Traceback (most recent call last):

[2022-12-05T02:54:28.958Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 26, in inner_wrapper

[2022-12-05T02:54:28.958Z]     res = func(*args, **_kwargs)

[2022-12-05T02:54:28.958Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 57, in api_request

[2022-12-05T02:54:28.958Z]     return func(*arg, **kwargs)

[2022-12-05T02:54:28.958Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 610, in search

[2022-12-05T02:54:28.958Z]     res = conn.search(self._name, data, anns_field, param, limit, expr,

[2022-12-05T02:54:28.958Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2022-12-05T02:54:28.958Z]     raise e

[2022-12-05T02:54:28.958Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2022-12-05T02:54:28.958Z]     return func(*args, **kwargs)

[2022-12-05T02:54:28.958Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2022-12-05T02:54:28.958Z]     ret = func(self, *args, **kwargs)

[2022-12-05T02:54:28.958Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 85, in handler

[2022-12-05T02:54:28.958Z]     raise e

[2022-12-05T02:54:28.958Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2022-12-05T02:54:28.958Z]     return func(self, *args, **kwargs)

[2022-12-05T02:54:28.958Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 469, in search

[2022-12-05T02:54:28.958Z]     return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)

[2022-12-05T02:54:28.958Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 438, in _execute_search_requests

[2022-12-05T02:54:28.958Z]     raise pre_err

[2022-12-05T02:54:28.958Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 429, in _execute_search_requests

[2022-12-05T02:54:28.958Z]     raise MilvusException(response.status.error_code, response.status.reason)

[2022-12-05T02:54:28.958Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=5, reason=Search 3 failed, reason target node id not match target id = 5, node id = 3 err %!w(<nil>))>

[2022-12-05T02:54:28.958Z]  (api_request.py:39)

[2022-12-05T02:54:28.958Z] [2022-12-05 02:54:28 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=fail to search on all shard leaders, err=fail to Search, QueryNode ID=5, reason=Search 3 failed, reason target node id not match target id = 5, node id = 3 err %!w(<nil>))> (api_request.py:40)

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/303/pipeline log: artifacts-datanode-pod-kill-303-pytest-logs.tar.gz artifacts-datanode-pod-kill-303-server-logs.tar.gz But it happened when running test_e2e.py. So why can it pass CI?

zhuwenxing commented 1 year ago

/assign @yah01 @yah01 Please take a look, because it may be related to the PR https://github.com/milvus-io/milvus/pull/20967

yah01 commented 1 year ago

Fixed with #20973, and verified with querynode-pod-kill-chaos-test, plz help check /assign @NicoYuan1986 /assign @zhuwenxing

NicoYuan1986 commented 1 year ago

I'll close the issue if it disappears in today's nightly report.

zhuwenxing commented 1 year ago

Not reproduced in my case for version master-20221205-243d8cff

NicoYuan1986 commented 1 year ago

Still occurring. https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/211/pipeline/187 milvus: 039e9ce pymilvus: pymilvus2.3.0.dev15

  1. log: artifacts-milvus-distributed-pulsar-nightly-211-pymilvus-e2e-logs.tar.gz
  2. collection name: delete_h6LwSYsS
  3. failed time: [2022-12-05T18:05:36.647Z] [gw4] [ 20%] FAILED testcases/test_delete.py::TestDeleteString::test_delete_entity_loop_with_string

Also in 2.2 branch: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/2.2.0/23/pipeline/187/ test_search_binary_jaccard_flat_index[500-128-False-True-True-BIN_IVF_FLAT]

yah01 commented 1 year ago

/assign @hhy3 same as #21000

xiaofan-luan commented 1 year ago

@yah01 is there a way to improve the error log a little bit

yah01 commented 1 year ago

@yah01 is there a way to improve the error log a little bit

21043

NicoYuan1986 commented 1 year ago

Latest: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/2.2.0/25/pipeline/143 case: test_search_binary_superstructure_flat_index[False-True-True-BIN_FLAT] test_search_sync_async_multiple_times[5000-2-32-False] test_search_after_different_index_with_params[32-False-True-HNSW-params4] test_search_after_index_different_metric_type[32-True-False-DISKANN-params6] test_search_index_partitions_fuzzy[5000-500-32-False-False-partition_names0] log: artifacts-milvus-distributed-kafka-nightly-25-pymilvus-e2e-logs.tar.gz

NicoYuan1986 commented 1 year ago

Latest: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/214/pipeline/155 case: test_search_sync_async_multiple_times[5000-2-128-True] test_search_with_expression[128-True-float < -2**8] test_search_after_index_different_metric_type[32-True-False-DISKANN-params6] log: artifacts-milvus-distributed-kafka-nightly-214-pymilvus-e2e-logs.tar.gz

yah01 commented 1 year ago

image image Two QueryNodes with the same address

yah01 commented 1 year ago

image @hhy3 It often causes panic while use DiskANN index

NicoYuan1986 commented 1 year ago

Update: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/2.2.0/29/pipeline/187

milvus: 2.2.0(50c7a80) pymilvus: 2.2.1.dev3

  1. log: artifacts-milvus-distributed-pulsar-nightly-29-pymilvus-e2e-logs.tar.gz
  2. collection name: search_collection_t9GA73Wb
  3. failed time: [2022-12-11T20:25:41.770Z] [gw5] [ 49%] FAILED testcases/test_search.py::TestCollectionSearch::test_search_collection_multiple_times[3000-500-32-True-False]
  4. error message:
    <MilvusException: (code=1, message=fail to Search, QueryNode ID=16, reason=QueryNode 20 can't serve, recovering: target node id not match target id = 16, node id = 20)> (api_request.py:40)
yah01 commented 1 year ago

This issue has been fixed in master, WIP cherry-pick all fix patches into 2.2.0

yah01 commented 1 year ago

/assign @NicoYuan1986 Both fixed for master and 2.2.0

NicoYuan1986 commented 1 year ago

Verify fixed.