milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.07k stars 2.88k forks source link

[Bug]: [benchmark] [cluster] Milvus insert 1183514 vectors, datanode crash #21691

Closed elstic closed 1 year ago

elstic commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:2.2.0-20230112-03ce5c26
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task : fouramf-cron-1673452800

test case:test_recall_glove_ivf_flat_cluster

server:


fouramf-cron-1673452800-1-58-8012-pulsar-zookeeper-2              1/1     Running     0              26m     10.104.5.75    4am-node12   <none>           <none> (cli_client.py:115)
--
  |   | fouramf-cron-1673452800-1-58-8012-pulsar-zookeeper-1              1/1     Running     0              27m     10.104.6.109   4am-node13   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-pulsar-zookeeper-0              1/1     Running     0              28m     10.104.4.53    4am-node11   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-pulsar-recovery-0               1/1     Running     0              28m     10.104.4.45    4am-node11   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-pulsar-pulsar-init-tvmlq        0/1     Completed   0              28m     10.104.6.100   4am-node13   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-pulsar-proxy-0                  1/1     Running     0              28m     10.104.4.47    4am-node11   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-pulsar-broker-0                 1/1     Running     0              28m     10.104.5.66    4am-node12   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-pulsar-bookie-init-dqnck        0/1     Completed   0              28m     10.104.6.97    4am-node13   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-pulsar-bookie-2                 1/1     Running     0              27m     10.104.5.72    4am-node12   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-pulsar-bookie-1                 1/1     Running     0              27m     10.104.6.107   4am-node13   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-pulsar-bookie-0                 1/1     Running     0              28m     10.104.4.52    4am-node11   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-minio-3                         1/1     Running     0              27m     10.104.9.218   4am-node14   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-minio-2                         1/1     Running     0              27m     10.104.5.73    4am-node12   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-minio-1                         1/1     Running     0              27m     10.104.6.106   4am-node13   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-minio-0                         1/1     Running     0              28m     10.104.4.54    4am-node11   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-milvus-rootcoord-7d8fd7bfrzfp   1/1     Running     0              28m     10.104.6.99    4am-node13   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-milvus-querynode-f8bcb6dtj9vg   1/1     Running     0              28m     10.104.6.98    4am-node13   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-milvus-querycoord-6b66799qmtp   1/1     Running     1 (23m ago)    28m     10.104.6.95    4am-node13   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-milvus-proxy-77bd6b94c-4nxrq    1/1     Running     1 (23m ago)    28m     10.104.6.96    4am-node13   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-milvus-indexnode-74c769fjrjjp   1/1     Running     0              28m     10.104.9.214   4am-node14   <none>           <none>Show context
  |   | fouramf-cron-1673452800-1-58-8012-milvus-indexcoord-bff477n4vlt   1/1     Running     1 (23m ago)    28m     10.104.6.94    4am-node13   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-milvus-datanode-756585c48kvdl   1/1     Running     1 (23m ago)    28m     10.104.4.44    4am-node11   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-milvus-datacoord-667c5d4stbxr   1/1     Running     0              28m     10.104.4.46    4am-node11   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-etcd-2                          1/1     Running     0              28m     10.104.6.105   4am-node13   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-etcd-1                          1/1     Running     0              28m     10.104.5.68    4am-node12   <none>           <none>
  |   | fouramf-cron-1673452800-1-58-8012-etcd-0                          1/1     Running     0              28m     10.104.9.216   4am-node14   <none>           <none>

search param:

 Params of search: nq:10000, anns_field:float_vector, param:{'metric_type': 'IP', 'params': {'nprobe': 256}}, limit:10, expr:"None", kwargs:{} (base.py:306)

image

image

raise error:

2023-01-12 13:49:07 | [2023-01-12 05:49:07,762 -  INFO - fouram]: [Base] Start clear collections (base.py:94)
-- | --
  |   | 2023-01-12 13:49:07 | [2023-01-12 05:49:07,762 - ERROR - fouram]: [AccCases] Search raise error:  (accuracy_cases.py:188)
  |   | 2023-01-12 13:49:07 | >)> (func_check.py:48)
  |   | 2023-01-12 13:49:07 | debug_error_string = "{"created":"@1673502547.758420604","description":"Error received from peer ipv4:10.255.67.213:19530","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
  |   | 2023-01-12 13:49:07 | details = "Deadline Exceeded"
  |   | 2023-01-12 13:49:07 | status = StatusCode.DEADLINE_EXCEEDED
  |   | 2023-01-12 13:49:07 | [2023-01-12 05:49:07,761 - ERROR - fouram]: [CheckFunc] search request check failed, response:<MilvusException: (code=1, message=<_InactiveRpcError of RPC that terminated with:
  |   | 2023-01-12 13:49:07 | >)> (api_request.py:49)
  |   | 2023-01-12 13:49:07 | debug_error_string = "{"created":"@1673502547.758420604","description":"Error received from peer ipv4:10.255.67.213:19530","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
  |   | 2023-01-12 13:49:07 | details = "Deadline Exceeded"
  |   | 2023-01-12 13:49:07 | status = StatusCode.DEADLINE_EXCEEDED
  |   | 2023-01-12 13:49:07 | [2023-01-12 05:49:07,761 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=<_InactiveRpcError of RPC that terminated with:
  |   | 2023-01-12 13:49:07 | (api_request.py:48)
  |   | 2023-01-12 13:49:07 | >)>
  |   | 2023-01-12 13:49:07 | debug_error_string = "{"created":"@1673502547.758420604","description":"Error received from peer ipv4:10.255.67.213:19530","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
  |   | 2023-01-12 13:49:07 | details = "Deadline Exceeded"
  |   | 2023-01-12 13:49:07 | status = StatusCode.DEADLINE_EXCEEDED
  |   | 2023-01-12 13:49:07 | pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=<_InactiveRpcError of RPC that terminated with:
  |   | 2023-01-12 13:49:07 | raise MilvusException(message=str(e)) from e
  |   | 2023-01-12 13:49:07 | File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 58, in handler
  |   | 2023-01-12 13:49:07 | ret = func(self, *args, **kwargs)
  |   | 2023-01-12 13:49:07 | File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler
  |   | 2023-01-12 13:49:07 | return func(*args, **kwargs)
  |   | 2023-01-12 13:49:07 | File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler
  |   | 2023-01-12 13:49:07 | raise e
  |   | 2023-01-12 13:49:07 | File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler
  |   | 2023-01-12 13:49:07 | res = conn.search(self._name, data, anns_field, param, limit, expr,
  |   | 2023-01-12 13:49:07 | File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 610, in search
  |   | 2023-01-12 13:49:07 | return func(*arg, **kwargs)
  |   | 2023-01-12 13:49:07 | File "/src/fouram/client/util/api_request.py", line 70, in api_request
  |   | 2023-01-12 13:49:07 | res = func(*args, **kwargs)
  |   | 2023-01-12 13:49:07 | File "/src/fouram/client/util/api_request.py", line 33, in inner_wrapper
  |   | 2023-01-12 13:49:07 | Traceback (most recent call last):
  |   | 2023-01-12 13:49:07 |  
  |   | 2023-01-12 13:49:07 | The above exception was the direct cause of the following exception:
  |   | 2023-01-12 13:49:07 |  
  |   | 2023-01-12 13:49:07 | >
  |   | 2023-01-12 13:49:07 | debug_error_string = "{"created":"@1673502547.758420604","description":"Error received from peer ipv4:10.255.67.213:19530","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
  |   | 2023-01-12 13:49:07 | details = "Deadline Exceeded"
  |   | 2023-01-12 13:49:07 | status = StatusCode.DEADLINE_EXCEEDED
  |   | 2023-01-12 13:49:07 | grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
  |   | 2023-01-12 13:49:07 | raise _InactiveRpcError(state)
  |   | 2023-01-12 13:49:07 | File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
  |   | 2023-01-12 13:49:07 | return _end_unary_response_blocking(state, call, False, None)
  |   | 2023-01-12 13:49:07 | File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 946, in __call__
  |   | 2023-01-12 13:49:07 | response = self._stub.Search(request, timeout=timeout)
  |   | 2023-01-12 13:49:07 | File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 426, in _execute_search_requests
  |   | 2023-01-12 13:49:07 | raise pre_err
  |   | 2023-01-12 13:49:07 | File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 438, in _execute_search_requests
  |   | 2023-01-12 13:49:07 | return self._execute_search_requests(requests, timeout, round_decimal=round_decimal, auto_id=auto_id, **kwargs)
  |   | 2023-01-12 13:49:07 | File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 469, in search
  |   | 2023-01-12 13:49:07 | return func(self, *args, **kwargs)
  |   | 2023-01-12 13:49:07 | File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler
  |   | 2023-01-12 13:49:07 | [2023-01-12 05:49:07,761 - ERROR - fouram]: Traceback (most recent call last):
  |   | 2023-01-12 13:49:07 | >)>, <Time:{'RPC start': '2023-01-12 05:44:07.549513', 'RPC error': '2023-01-12 05:49:07.759153'}> (decorators.py:108)
  |   | 2023-01-12 13:49:07 | debug_error_string = "{"created":"@1673502547.758420604","description":"Error received from peer ipv4:10.255.67.213:19530","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
  |   | 2023-01-12 13:49:07 | details = "Deadline Exceeded"
  |   | 2023-01-12 13:49:07 | status = StatusCode.DEADLINE_EXCEEDED

Expected Behavior

No response

Steps To Reproduce

1. create a collection or use an existing collection
2. insert training dataset
3. flush collection
4. clean index and build new index
5. load collection
6. search with different parameters
7. clean all collections or not

Milvus Log

No response

Anything else?

No response

xiaofan-luan commented 1 year ago

/assign @yhmo could you pls take a look into it?

yanliang567 commented 1 year ago

/unassign

yhmo commented 1 year ago

The phenomena is very like the https://github.com/milvus-io/milvus/issues/21527 But I didn't found crash log in this test environment.

jingkl commented 1 year ago

server:

fouramf-cron-1674835200-1-89-5217-etcd-0                          1/1     Running            0               2d13h   10.104.1.33    4am-node10   <none>           <none>
fouramf-cron-1674835200-1-89-5217-etcd-1                          1/1     Running            0               2d13h   10.104.9.60    4am-node14   <none>           <none>
fouramf-cron-1674835200-1-89-5217-etcd-2                          1/1     Running            0               2d13h   10.104.4.211   4am-node11   <none>           <none>
fouramf-cron-1674835200-1-89-5217-milvus-datacoord-c5d98bfrnznt   1/1     Running            0               2d13h   10.104.4.207   4am-node11   <none>           <none>
fouramf-cron-1674835200-1-89-5217-milvus-datanode-779c894889hd2   0/1     CrashLoopBackOff   670 (61s ago)   2d13h   10.104.9.52    4am-node14   <none>           <none>
fouramf-cron-1674835200-1-89-5217-milvus-indexcoord-7665f8jx46w   1/1     Running            0               2d13h   10.104.9.54    4am-node14   <none>           <none>
fouramf-cron-1674835200-1-89-5217-milvus-indexnode-5d6f898bnpc4   1/1     Running            0               2d13h   10.104.1.31    4am-node10   <none>           <none>
fouramf-cron-1674835200-1-89-5217-milvus-proxy-567c784c5d-bhzbx   1/1     Running            0               2d13h   10.104.1.29    4am-node10   <none>           <none>
fouramf-cron-1674835200-1-89-5217-milvus-querycoord-665fc68dtj4   1/1     Running            0               2d13h   10.104.9.56    4am-node14   <none>           <none>
fouramf-cron-1674835200-1-89-5217-milvus-querynode-bdf658bm5p7s   1/1     Running            0               2d13h   10.104.4.208   4am-node11   <none>           <none>
fouramf-cron-1674835200-1-89-5217-milvus-rootcoord-dd4ddfcbc28d   1/1     Running            0               2d13h   10.104.4.209   4am-node11   <none>           <none>
fouramf-cron-1674835200-1-89-5217-minio-0                         1/1     Running            0               2d13h   10.104.1.39    4am-node10   <none>           <none>
fouramf-cron-1674835200-1-89-5217-minio-1                         1/1     Running            0               2d13h   10.104.9.64    4am-node14   <none>           <none>
fouramf-cron-1674835200-1-89-5217-minio-2                         1/1     Running            0               2d13h   10.104.4.215   4am-node11   <none>           <none>
fouramf-cron-1674835200-1-89-5217-minio-3                         1/1     Running            0               2d13h   10.104.6.111   4am-node13   <none>           <none>
fouramf-cron-1674835200-1-89-5217-pulsar-bookie-0                 1/1     Running            0               2d13h   10.104.1.38    4am-node10   <none>           <none>
fouramf-cron-1674835200-1-89-5217-pulsar-bookie-1                 1/1     Running            0               2d13h   10.104.9.63    4am-node14   <none>           <none>
fouramf-cron-1674835200-1-89-5217-pulsar-bookie-2                 1/1     Running            0               2d13h   10.104.4.216   4am-node11   <none>           <none>
fouramf-cron-1674835200-1-89-5217-pulsar-bookie-init-d9zxk        0/1     Completed          0               2d13h   10.104.9.50    4am-node14   <none>           <none>
fouramf-cron-1674835200-1-89-5217-pulsar-broker-0                 1/1     Running            0               2d13h   10.104.9.53    4am-node14   <none>           <none>
fouramf-cron-1674835200-1-89-5217-pulsar-proxy-0                  1/1     Running            0               2d13h   10.104.9.55    4am-node14   <none>           <none>
fouramf-cron-1674835200-1-89-5217-pulsar-pulsar-init-mkd4g        0/1     Completed          0               2d13h   10.104.9.46    4am-node14   <none>           <none>
fouramf-cron-1674835200-1-89-5217-pulsar-recovery-0               1/1     Running            0               2d13h   10.104.9.51    4am-node14   <none>           <none>
fouramf-cron-1674835200-1-89-5217-pulsar-zookeeper-0              1/1     Running            0               2d13h   10.104.1.34    4am-node10   <none>           <none>
fouramf-cron-1674835200-1-89-5217-pulsar-zookeeper-1              1/1     Running            0               2d13h   10.104.9.66    4am-node14   <none>           <none>
fouramf-cron-1674835200-1-89-5217-pulsar-zookeeper-2              1/1     Running            0               2d13h   10.104.4.218   4am-node11   <none>           <none>

client log:

[2023-01-27 19:13:35,465 -  INFO - fouram]: [Base] Start inserting, ids: 920000 - 929999, data size: 1,000,000 (base.py:133)
[2023-01-27 19:13:35,719 -  INFO - fouram]: [Time] Collection.insert run in 0.2534s (api_request.py:41)
[2023-01-27 19:13:35,720 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_Wuokn1oq): 0 (base.py:281)
[2023-01-27 19:13:35,946 -  INFO - fouram]: [Base] Start inserting, ids: 930000 - 939999, data size: 1,000,000 (base.py:133)
[2023-01-27 19:13:36,173 -  INFO - fouram]: [Time] Collection.insert run in 0.2268s (api_request.py:41)
[2023-01-27 19:13:36,175 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_Wuokn1oq): 0 (base.py:281)
[2023-01-27 19:13:36,268 -  INFO - fouram]: [Base] Start inserting, ids: 940000 - 949999, data size: 1,000,000 (base.py:133)
[2023-01-27 19:13:36,491 -  INFO - fouram]: [Time] Collection.insert run in 0.2221s (api_request.py:41)
[2023-01-27 19:13:36,493 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_Wuokn1oq): 0 (base.py:281)
[2023-01-27 19:13:36,563 -  INFO - fouram]: [Base] Start inserting, ids: 950000 - 959999, data size: 1,000,000 (base.py:133)
[2023-01-27 19:13:36,787 -  INFO - fouram]: [Time] Collection.insert run in 0.2235s (api_request.py:41)
[2023-01-27 19:13:36,788 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_Wuokn1oq): 0 (base.py:281)
[2023-01-27 19:13:36,877 -  INFO - fouram]: [Base] Start inserting, ids: 960000 - 969999, data size: 1,000,000 (base.py:133)
[2023-01-27 19:13:37,202 -  INFO - fouram]: [Time] Collection.insert run in 0.3247s (api_request.py:41)
[2023-01-27 19:13:37,204 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_Wuokn1oq): 0 (base.py:281)
[2023-01-27 19:13:37,287 -  INFO - fouram]: [Base] Start inserting, ids: 970000 - 979999, data size: 1,000,000 (base.py:133)
[2023-01-27 19:13:37,491 -  INFO - fouram]: [Time] Collection.insert run in 0.2037s (api_request.py:41)
[2023-01-27 19:13:37,493 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_Wuokn1oq): 0 (base.py:281)
[2023-01-27 19:13:37,569 -  INFO - fouram]: [Base] Start inserting, ids: 980000 - 989999, data size: 1,000,000 (base.py:133)
[2023-01-27 19:13:37,774 -  INFO - fouram]: [Time] Collection.insert run in 0.2045s (api_request.py:41)
[2023-01-27 19:13:37,776 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_Wuokn1oq): 0 (base.py:281)
[2023-01-27 19:13:37,861 -  INFO - fouram]: [Base] Start inserting, ids: 990000 - 999999, data size: 1,000,000 (base.py:133)
[2023-01-27 19:13:38,064 -  INFO - fouram]: [Time] Collection.insert run in 0.2032s (api_request.py:41)
[2023-01-27 19:13:38,066 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_Wuokn1oq): 0 (base.py:281)
[2023-01-27 19:13:38,081 -  INFO - fouram]: [Base] Total time of ann insert: 29.8852s (base.py:231)
[2023-01-27 19:13:38,081 -  INFO - fouram]: [Base] Start flush collection fouram_Wuokn1oq (base.py:104)

params:

{'dataset_params': {'dim': 128, 'dataset_name': 'sift-128-euclidean', 'ni_per': 10000}, 'collection_params': {'other_fields': []}, 'load_params': {'replica_number': 1}, 'search_params': {'top_k': [10], 'nq': [10000], 'search_param': {'search_k': [50, 100, 500, 1000]}}, 'index_params': {'index_type': 'ANNOY', 'index_param': {'n_trees': 8}}} (params_check.py:31)

testcase: test_recall_sift_annoy_cluster

xiaofan-luan commented 1 year ago

Member @jingkl don't think that's issue,pls check the bucket config

image
jingkl commented 1 year ago

fouram_log (2).debug.zip,Log of the relevant configuration, without changing the configuration of the relevant bucket

jingkl commented 1 year ago

@xiaofan-luan Please take a look again

xiaofan-luan commented 1 year ago

/assign @xiaofan-luan

soothing-rain commented 1 year ago

/unassign @yhmo /assign @wayblink

elstic commented 1 year ago

This problem is not present in recent versions. Verify image: 2.2.0-20230803-6a20862c