milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.06k stars 2.88k forks source link

[Bug]: Load collection failed with error `collection 438138738384120298 has not been loaded to memory or load failed` after cluster reinstall and upgrade #21299

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: 2.2.0-20221216-1aa7a9a8
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus==2.3.0.dev21
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

There is an error raised when calling load API

[2022-12-18T13:12:29.618Z]         delete_expr = f"{ct.default_int64_field_name} in [0,1,2,3,4,5,6,7,8,9]"

[2022-12-18T13:12:29.618Z]         collection_w.delete(expr=delete_expr)

[2022-12-18T13:12:29.618Z]     

[2022-12-18T13:12:29.618Z]         # search and query

[2022-12-18T13:12:29.618Z]         collection_w.search(vectors_to_search[:default_nq], default_search_field,

[2022-12-18T13:12:29.618Z]                             search_params, default_limit,

[2022-12-18T13:12:29.618Z]                             default_search_exp,

[2022-12-18T13:12:29.618Z]                             output_fields=[ct.default_int64_field_name],

[2022-12-18T13:12:29.618Z]                             check_task=CheckTasks.check_search_results,

[2022-12-18T13:12:29.618Z]                             check_items={"nq": default_nq,

[2022-12-18T13:12:29.618Z]                                          "limit": default_limit})

[2022-12-18T13:12:29.618Z]         collection_w.query(default_term_expr, output_fields=[ct.default_int64_field_name],

[2022-12-18T13:12:29.618Z]                            check_task=CheckTasks.check_query_not_empty)

[2022-12-18T13:12:29.618Z]     

[2022-12-18T13:12:29.618Z]         # drop index if exist

[2022-12-18T13:12:29.618Z]         if len(index_names) > 0:

[2022-12-18T13:12:29.618Z]             for index_name in index_names:

[2022-12-18T13:12:29.618Z]                 collection_w.release()

[2022-12-18T13:12:29.618Z]                 collection_w.drop_index(index_name=index_name)

[2022-12-18T13:12:29.618Z]             default_index_param = gen_index_param(vector_index_type)

[2022-12-18T13:12:29.618Z]             self.create_index(collection_w, default_index_field, default_index_param)

[2022-12-18T13:12:29.618Z]     

[2022-12-18T13:12:29.618Z] >           collection_w.load()
[2022-12-18T13:12:29.621Z] <name>: deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_0_is_deleted_is_deleted_data_size_3000

[2022-12-18T13:12:29.621Z] <partitions>: [{"name": "_default", "collection_name": "deploy_test_index_type_BIN_IVF_FLAT_i......  (api_request.py:31)

[2022-12-18T13:12:29.621Z] [2022-12-18 13:01:37 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 120} (api_request.py:56)

[2022-12-18T13:12:29.621Z] [2022-12-18 13:01:40 - DEBUG - ci_test]: (api_response) : None  (api_request.py:31)

[2022-12-18T13:12:29.621Z] [2022-12-18 13:01:40 - INFO - ci_test]: inserted 3000 data into collection deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_0_is_deleted_is_deleted_data_size_3000 (common_func.py:740)

[2022-12-18T13:12:29.621Z] [2022-12-18 13:01:40 - DEBUG - ci_test]: (api_request)  : [Collection.insert] args: [      int64   float varchar                                      binary_vector

[2022-12-18T13:12:29.621Z] 0         0     0.0       0   b'\xab\x82!\x8c\n\r\xcaM\x1e\xdf\x9fz\xb0\x8aBh'

[2022-12-18T13:12:29.621Z] 1         1     1.0       1     b'\xcb\x81\xaaO\xa5O\xc7D\xb0V@>]\x17\x11\xa4'

[2022-12-18T13:12:29.621Z] 2         2     2.0       2        b'$;\xa2\xe6*\xf4\xa7!-\x05......, kwargs: {'timeout': 120} (api_request.py:56)

[2022-12-18T13:12:29.621Z] [2022-12-18 13:01:40 - DEBUG - ci_test]: (api_response) : (insert count: 3000, delete count: 0, upsert count: 0, timestamp: 438139224271618052, success count: 3000, err count: 0)  (api_request.py:31)

[2022-12-18T13:12:29.621Z] [2022-12-18 13:01:52 - INFO - ci_test]: index info: [] (test_action_second_deployment.py:58)

[2022-12-18T13:12:29.621Z] [2022-12-18 13:11:59 - ERROR - pymilvus.decorators]: RPC error: [get_loading_progress], <MilvusException: (code=1, message=collection 438138892335681167 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2022-12-18 13:11:59.558729', 'RPC error': '2022-12-18 13:11:59.560082'}> (decorators.py:108)

[2022-12-18T13:12:29.621Z] [2022-12-18 13:11:59 - ERROR - pymilvus.decorators]: RPC error: [wait_for_loading_collection], <MilvusException: (code=1, message=collection 438138892335681167 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2022-12-18 13:01:56.655247', 'RPC error': '2022-12-18 13:11:59.560273'}> (decorators.py:108)

[2022-12-18T13:12:29.621Z] [2022-12-18 13:11:59 - ERROR - pymilvus.decorators]: RPC error: [load_collection], <MilvusException: (code=1, message=collection 438138892335681167 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2022-12-18 13:01:56.644749', 'RPC error': '2022-12-18 13:11:59.560328'}> (decorators.py:108)[get_env_variable] failed to get environment variables : 'CI_LOG_PATH', use default path : /tmp/ci_logs

[2022-12-18T13:12:29.621Z] 

[2022-12-18T13:12:29.621Z] ------------- generated html file: file:///tmp/ci_logs/report.html -------------

[2022-12-18T13:12:29.621Z] =========================== short test summary info ============================

[2022-12-18T13:12:29.621Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_0_is_deleted_is_deleted_data_size_3000] - pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=collection 438138892335681167 has not been loaded to memory or load failed)>

[2022-12-18T13:12:29.621Z] =================== 1 failed, 49 passed in 702.87s (0:11:42) ===================

script returned exit code 1

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

milvus mode: cluster deploy task: upgrade old image tag: v2.2.0 new image tag: 2.2.0-20221216-1aa7a9a8 failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/102/pipeline

log: artifacts-kafka-cluster-upgrade-102-server-second-deployment-logs.tar.gz artifacts-kafka-cluster-upgrade-102-server-first-deployment-logs.tar.gz artifacts-kafka-cluster-upgrade-102-pytest-logs.tar.gz

milvus mode: cluster deploy task: reinstall old image tag: v2.2.0 new image tag: 2.2.0-20221216-1aa7a9a8


[2022-12-18T13:10:20.766Z] =========================== short test summary info ============================

[2022-12-18T13:10:20.766Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_BIN_IVF_FLAT_is_compacted_not_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_1_is_deleted_is_deleted_data_size_3000] - pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=collection 438138738385528495 has not been loaded to memory or load failed)>

[2022-12-18T13:10:20.766Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=collection 438138738385527781 has not been loaded to memory or load failed)>

[2022-12-18T13:10:20.766Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_HNSW_is_compacted_not_compacted_segment_status_only_growing_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=collection 438138738384120298 has not been loaded to memory or load failed)>

[2022-12-18T13:10:20.766Z] ================== 3 failed, 47 passed in 1127.29s (0:18:47) ===================

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/100/pipeline

log:

artifacts-kafka-cluster-reinstall-100-server-second-deployment-logs.tar.gz artifacts-kafka-cluster-reinstall-100-server-first-deployment-logs.tar.gz artifacts-kafka-cluster-reinstall-100-pytest-logs.tar.gz

Anything else?

No response

yanliang567 commented 1 year ago

/assign @jiaoew1991 /unassign

congqixia commented 1 year ago

@zhuwenxing from the log, git upgrade commit was 1aa7a9a8, could you please have another run which upgrade target is ae5259c(v2.2.1)? Need to verify problem for either commit.

zhuwenxing commented 1 year ago

It works well when upgrading from v2.2.0 to 2.2.1 image

zhuwenxing commented 1 year ago

milvus mode: cluster deploy task: reinstall old image tag: v2.2.0 new image tag: 2.2.0-20221219-69b7eeb7

testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_HNSW_is_compacted_is_compacted_segment_status_all_is_string_indexed_not_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - AssertionError: assert 0 == 2

[2022-12-19T13:05:52.530Z]  +  where 2 = int('2')

[2022-12-19T13:05:52.530Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_BIN_IVF_FLAT_is_compacted_is_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_2_is_deleted_is_deleted_data_size_3000] - AssertionError: assert 0 == 2

[2022-12-19T13:05:52.530Z]  +  where 2 = int('2')

[2022-12-19T13:05:52.530Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_BIN_IVF_FLAT_is_compacted_not_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_1_is_deleted_is_deleted_data_size_3000] - pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=collection 438161394323657777 has not been loaded to memory or load failed)>

[2022-12-19T13:05:52.530Z] FAILED testcases/test_action_second_deployment.py::TestActionSecondDeployment::test_check[deploy_test_index_type_BIN_IVF_FLAT_is_compacted_not_compacted_segment_status_all_is_string_indexed_is_string_indexed_replica_number_0_is_deleted_is_deleted_data_size_3000] - pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=collection 438161394323656608 has not been loaded to memory or load failed)>

[2022-12-19T13:05:52.530Z] =================== 4 failed, 46 passed in 837.96s (0:13:57) ===================

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/109/pipeline log: artifacts-kafka-cluster-reinstall-109-server-second-deployment-logs.tar.gz

artifacts-kafka-cluster-reinstall-109-server-first-deployment-logs.tar.gz

yah01 commented 1 year ago

The leader view is never fixed while it's different with distribution, results in the collection observer can't see the loaded segment /assign @weiliu1031 need to cherry-pick the fix #20478 to 2.2

weiliu1031 commented 1 year ago

@zhuwenxing please verify this

zhuwenxing commented 1 year ago

/assign

weiliu1031 commented 1 year ago

/assign any progress?

zhuwenxing commented 1 year ago

Not reproduced in 2.2.2-20221222-0fdc1a04