milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.39k stars 2.82k forks source link

[Bug]: Load collection timeout after Milvus recovered from querycoord pod kill chaos test #35087

Open zhuwenxing opened 1 month ago

zhuwenxing commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version:2.4-20240729-e2fc8853-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior


[2024-07-29T11:30:09.033Z] <name>: Checker__39o1EG8v

[2024-07-29T11:30:09.033Z] <description>: 

[2024-07-29T11:30:09.033Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}......  (api_request.py:37)

[2024-07-29T11:30:09.033Z] [2024-07-29 11:25:24 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-07-29T11:30:09.033Z] [2024-07-29 11:25:25 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-07-29T11:30:09.033Z] [2024-07-29 11:25:25 - DEBUG - ci_test]: (api_request)  : [Collection.compact] args: [180], kwargs: {} (api_request.py:62)

[2024-07-29T11:30:09.033Z] [2024-07-29 11:25:25 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-07-29T11:30:09.033Z] [2024-07-29 11:26:02 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-07-29T11:30:09.033Z] [2024-07-29 11:26:03 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-07-29T11:30:09.033Z] [2024-07-29 11:26:03 - INFO - ci_test]: assert create collection: 0.015234708786010742, init_entities: 162000 (test_all_collections_after_chaos.py:49)

[2024-07-29T11:30:09.033Z] [2024-07-29 11:26:07 - DEBUG - ci_test]: (api_request)  : [Collection.insert] args: [[[-3000, -2999, -2998, -2997, -2996, -2995, -2994, -2993, -2992, -2991, -2990, -2989, -2988, -2987, -2986, -2985, -2984, -2983, -2982, -2981, -2980, -2979, -2978, -2977, -2976, -2975, -2974, -2973, -2972, -2971, -2970, -2969, -2968, -2967, -2966, -2965, -2964, -2963, -2962, -2961, -2960, -2959, -29......, kwargs: {'timeout': 180} (api_request.py:62)

[2024-07-29T11:30:09.033Z] [2024-07-29 11:26:08 - DEBUG - ci_test]: (api_response) : (insert count: 2000, delete count: 0, upsert count: 0, timestamp: 451478124849266690, success count: 2000, err count: 0, cost: 0)  (api_request.py:37)

[2024-07-29T11:30:09.033Z] [2024-07-29 11:26:08 - INFO - ci_test]: assert insert: 2.2203476428985596 (test_all_collections_after_chaos.py:57)

[2024-07-29T11:30:09.033Z] [2024-07-29 11:26:08 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-07-29T11:30:09.033Z] [2024-07-29 11:26:11 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-07-29T11:30:09.033Z] [2024-07-29 11:26:11 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-07-29T11:30:09.033Z] [2024-07-29 11:26:12 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)

[2024-07-29T11:30:09.033Z] [2024-07-29 11:26:12 - INFO - ci_test]: assert flush: 3.0466232299804688, entities: 164000 (test_all_collections_after_chaos.py:67)

[2024-07-29T11:30:09.033Z] [2024-07-29 11:26:12 - INFO - ci_test]: index info: [{'collection': 'Checker__39o1EG8v', 'field': 'image_emb', 'index_name': 'image_emb', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__39o1EG8v', 'field': 'text_emb', 'index_name': 'text_emb', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__39o1EG8v', 'field': 'voice_emb', 'index_name': 'voice_emb', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__39o1EG8v', 'field': 'int64', 'index_name': 'int64', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__39o1EG8v', 'field': 'float', 'index_name': 'float', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__39o1EG8v', 'field': 'varchar', 'index_name': 'varchar', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__39o1EG8v', 'field': 'float_vector', 'index_name': 'float_vector', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}] (test_all_collections_after_chaos.py:71)

[2024-07-29T11:30:09.034Z] [2024-07-29 11:26:12 - INFO - ci_test]: index info: [{'collection': 'Checker__39o1EG8v', 'field': 'voice_emb', 'index_name': 'voice_emb', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__39o1EG8v', 'field': 'int64', 'index_name': 'int64', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__39o1EG8v', 'field': 'float', 'index_name': 'float', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__39o1EG8v', 'field': 'varchar', 'index_name': 'varchar', 'index_param': {'index_type': 'INVERTED'}}, {'collection': 'Checker__39o1EG8v', 'field': 'float_vector', 'index_name': 'float_vector', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__39o1EG8v', 'field': 'image_emb', 'index_name': 'image_emb', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}, {'collection': 'Checker__39o1EG8v', 'field': 'text_emb', 'index_name': 'text_emb', 'index_param': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 48, 'efConstruction': 500}}}] (test_all_collections_after_chaos.py:86)

[2024-07-29T11:30:09.034Z] [2024-07-29 11:26:12 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 180], kwargs: {} (api_request.py:62)

[2024-07-29T11:30:09.034Z] [2024-07-29 11:29:12 - ERROR - pymilvus.decorators]: RPC error: [wait_for_loading_collection], <MilvusException: (code=1, message=wait for loading collection timeout, collection: Checker__39o1EG8v)>, <Time:{'RPC start': '2024-07-29 11:26:12.292044', 'RPC error': '2024-07-29 11:29:12.295765'}> (decorators.py:146)

[2024-07-29T11:30:09.034Z] [2024-07-29 11:29:12 - WARNING - pymilvus.decorators]: Retry timeout: 180s (decorators.py:106)

[2024-07-29T11:30:09.034Z] [2024-07-29 11:29:12 - ERROR - pymilvus.decorators]: RPC error: [load_collection], <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for loading collection timeout, collection: Checker__39o1EG8v)>, <Time:{'RPC start': '2024-07-29 11:26:12.262751', 'RPC error': '2024-07-29 11:29:12.296696'}> (decorators.py:146)

[2024-07-29T11:30:09.034Z] [2024-07-29 11:29:12 - ERROR - ci_test]: Traceback (most recent call last):

[2024-07-29T11:30:09.034Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-07-29T11:30:09.034Z]     res = func(*args, **_kwargs)

[2024-07-29T11:30:09.034Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-07-29T11:30:09.034Z]     return func(*arg, **kwargs)

[2024-07-29T11:30:09.034Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 424, in load

[2024-07-29T11:30:09.034Z]     conn.load_collection(

[2024-07-29T11:30:09.034Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 147, in handler

[2024-07-29T11:30:09.034Z]     raise e from e

[2024-07-29T11:30:09.034Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 143, in handler

[2024-07-29T11:30:09.034Z]     return func(*args, **kwargs)

[2024-07-29T11:30:09.034Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 182, in handler

[2024-07-29T11:30:09.034Z]     return func(self, *args, **kwargs)

[2024-07-29T11:30:09.034Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 107, in handler

[2024-07-29T11:30:09.034Z]     raise MilvusException(

[2024-07-29T11:30:09.034Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for loading collection timeout, collection: Checker__39o1EG8v)>

[2024-07-29T11:30:09.034Z]  (api_request.py:45)

[2024-07-29T11:30:09.034Z] [2024-07-29 11:29:12 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for loading collection timeout, collection: Checker__39o1EG8v)> (api_request.py:46)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/14944/pipeline

log: artifacts-querycoord-pod-kill-14944-server-logs.tar.gz

cluster:4am ns:chaos-testing pod info:

 + kubectl get pods -o wide
 + grep querycoord-pod-kill-14944
 querycoord-pod-kill-14944-etcd-0                                  1/1     Running            0               36m     10.104.24.27    4am-node29   <none>           <none>
 querycoord-pod-kill-14944-etcd-1                                  1/1     Running            0               36m     10.104.26.125   4am-node32   <none>           <none>
 querycoord-pod-kill-14944-etcd-2                                  1/1     Running            0               36m     10.104.18.17    4am-node25   <none>           <none>
 querycoord-pod-kill-14944-milvus-datacoord-7bfd858964-sj79n       1/1     Running            4               36m     10.104.6.195    4am-node13   <none>           <none>
 querycoord-pod-kill-14944-milvus-datanode-77679bcc66-fn6cf        1/1     Running            4 (35m ago)     36m     10.104.32.231   4am-node39   <none>           <none>
 querycoord-pod-kill-14944-milvus-datanode-77679bcc66-hpfkg        1/1     Running            4 (34m ago)     36m     10.104.30.108   4am-node38   <none>           <none>
 querycoord-pod-kill-14944-milvus-indexcoord-67bf65c7f9-2cxtp      1/1     Running            0               36m     10.104.6.197    4am-node13   <none>           <none>
 querycoord-pod-kill-14944-milvus-indexnode-5ff5467655-66l9f       1/1     Running            3 (35m ago)     36m     10.104.33.98    4am-node36   <none>           <none>
 querycoord-pod-kill-14944-milvus-indexnode-5ff5467655-7l5mn       1/1     Running            3 (35m ago)     36m     10.104.23.179   4am-node27   <none>           <none>
 querycoord-pod-kill-14944-milvus-indexnode-5ff5467655-czgxl       1/1     Running            4 (35m ago)     36m     10.104.6.196    4am-node13   <none>           <none>
 querycoord-pod-kill-14944-milvus-proxy-7c4fdd6cf5-psmjc           1/1     Running            3 (35m ago)     36m     10.104.4.32     4am-node11   <none>           <none>
 querycoord-pod-kill-14944-milvus-querycoord-6bbd7d6bd8-q6gdq      1/1     Running            0               14m     10.104.17.108   4am-node23   <none>           <none>
 querycoord-pod-kill-14944-milvus-querynode-686cfb56d9-8bh9g       1/1     Running            3 (35m ago)     36m     10.104.13.9     4am-node16   <none>           <none>
 querycoord-pod-kill-14944-milvus-querynode-686cfb56d9-d2q2f       1/1     Running            4 (35m ago)     36m     10.104.16.164   4am-node21   <none>           <none>
 querycoord-pod-kill-14944-milvus-querynode-686cfb56d9-pfdlp       1/1     Running            4 (35m ago)     36m     10.104.34.170   4am-node37   <none>           <none>
 querycoord-pod-kill-14944-milvus-rootcoord-8d8b55f7b-hbhmm        1/1     Running            3 (35m ago)     36m     10.104.6.193    4am-node13   <none>           <none>
 querycoord-pod-kill-14944-minio-0                                 1/1     Running            0               36m     10.104.26.121   4am-node32   <none>           <none>
 querycoord-pod-kill-14944-minio-1                                 1/1     Running            0               36m     10.104.20.13    4am-node22   <none>           <none>
 querycoord-pod-kill-14944-minio-2                                 1/1     Running            0               36m     10.104.24.28    4am-node29   <none>           <none>
 querycoord-pod-kill-14944-minio-3                                 1/1     Running            0               36m     10.104.18.16    4am-node25   <none>           <none>
 querycoord-pod-kill-14944-pulsar-bookie-0                         1/1     Running            0               36m     10.104.20.12    4am-node22   <none>           <none>
 querycoord-pod-kill-14944-pulsar-bookie-1                         1/1     Running            0               36m     10.104.26.126   4am-node32   <none>           <none>
 querycoord-pod-kill-14944-pulsar-bookie-init-wl4wf                0/1     Completed          0               36m     10.104.32.229   4am-node39   <none>           <none>
 querycoord-pod-kill-14944-pulsar-broker-0                         1/1     Running            0               36m     10.104.24.22    4am-node29   <none>           <none>
 querycoord-pod-kill-14944-pulsar-proxy-0                          1/1     Running            0               36m     10.104.26.115   4am-node32   <none>           <none>
 querycoord-pod-kill-14944-pulsar-pulsar-init-gmjgr                0/1     Completed          0               36m     10.104.32.232   4am-node39   <none>           <none>
 querycoord-pod-kill-14944-pulsar-recovery-0                       1/1     Running            0               36m     10.104.32.230   4am-node39   <none>           <none>
 querycoord-pod-kill-14944-pulsar-zookeeper-0                      1/1     Running            0               36m     10.104.26.122   4am-node32   <none>           <none>

Anything else?

No response

zhuwenxing commented 1 month ago

/assign @weiliu1031 PTAL

zhuwenxing commented 1 month ago

not a stable reproduced issue