milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.4k stars 2.92k forks source link

[Bug]: Refresh load failed with error `call query coordinator LoadCollection: collection not fully loaded` in ci case #37166

Open zhuwenxing opened 2 weeks ago

zhuwenxing commented 2 weeks ago

Is there an existing issue for this?

Environment

- Milvus version:master
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

After Import, execute Refresh Load, Refresh Load failed

[pytest : test] __________ TestCreateImportJob.test_job_e2e[False-False-False-2-3000] __________

[pytest : test] [gw1] linux -- Python 3.8.17 /usr/local/bin/python3

[pytest : test] [gw1] linux -- Python 3.8.17 /usr/local/bin/python3[gw1] linux -- Python 3.8.17 /usr/local/bin/python3

[pytest : test] 

[pytest : test] self = <test_jobs_operation.TestCreateImportJob object at 0x7f2c804995e0>

[pytest : test] insert_num = 3000, import_task_num = 2, auto_id = False

[pytest : test] is_partition_key = False, enable_dynamic_field = False

[pytest : test] 

[pytest : test]     @pytest.mark.parametrize("insert_num", [3000])

[pytest : test]     @pytest.mark.parametrize("import_task_num", [2])

[pytest : test]     @pytest.mark.parametrize("auto_id", [True, False])

[pytest : test]     @pytest.mark.parametrize("is_partition_key", [True, False])

[pytest : test]     @pytest.mark.parametrize("enable_dynamic_field", [True, False])

[pytest : test]     def test_job_e2e(self, insert_num, import_task_num, auto_id, is_partition_key, enable_dynamic_field):

[pytest : test]         # create collection

[pytest : test]         name = gen_collection_name()

[pytest : test]         dim = 128

[pytest : test]         payload = {

[pytest : test]             "collectionName": name,

[pytest : test]             "schema": {

[pytest : test]                 "autoId": auto_id,

[pytest : test]                 "enableDynamicField": enable_dynamic_field,

[pytest : test]                 "fields": [

[pytest : test]                     {"fieldName": "book_id", "dataType": "Int64", "isPrimary": True, "elementTypeParams": {}},

[pytest : test]                     {"fieldName": "word_count", "dataType": "Int64", "isPartitionKey": is_partition_key, "elementTypeParams": {}},

[pytest : test]                     {"fieldName": "book_describe", "dataType": "VarChar", "elementTypeParams": {"max_length": "256"}},

[pytest : test]                     {"fieldName": "book_intro", "dataType": "FloatVector", "elementTypeParams": {"dim": f"{dim}"}}

[pytest : test]                 ]

[pytest : test]             },

[pytest : test]             "indexParams": [{"fieldName": "book_intro", "indexName": "book_intro_vector", "metricType": "L2"}]

[pytest : test]         }

[pytest : test]         rsp = self.collection_client.collection_create(payload)

[pytest : test]     

[pytest : test]         # upload file to storage

[pytest : test]         data = []

[pytest : test]         for i in range(insert_num):

[pytest : test]             tmp = {

[pytest : test]                 "word_count": i,

[pytest : test]                 "book_describe": f"book_{i}",

[pytest : test]                 "book_intro": [np.float32(random.random()) for _ in range(dim)]

[pytest : test]             }

[pytest : test]             if not auto_id:

[pytest : test]                 tmp["book_id"] = i

[pytest : test]             if enable_dynamic_field:

[pytest : test]                 tmp.update({f"dynamic_field_{i}": i})

[pytest : test]             data.append(tmp)

[pytest : test]         # dump data to file

[pytest : test]         file_name = f"bulk_insert_data_{uuid4()}.json"

[pytest : test]         file_path = f"/tmp/{file_name}"

[pytest : test]         with open(file_path, "w") as f:

[pytest : test]             json.dump(data, f, cls=NumpyEncoder)

[pytest : test]         # upload file to minio storage

[pytest : test]         self.storage_client.upload_file(file_path, file_name)

[pytest : test]     

[pytest : test]         # create import job

[pytest : test]         payload = {

[pytest : test]             "collectionName": name,

[pytest : test]             "files": [[file_name]],

[pytest : test]         }

[pytest : test]         for i in range(import_task_num):

[pytest : test]             rsp = self.import_job_client.create_import_jobs(payload)

[pytest : test]         # list import job

[pytest : test]         payload = {

[pytest : test]             "collectionName": name,

[pytest : test]         }

[pytest : test]         rsp = self.import_job_client.list_import_jobs(payload)

[pytest : test]     

[pytest : test]         # get import job progress

[pytest : test]         for task in rsp['data']["records"]:

[pytest : test]             task_id = task['jobId']

[pytest : test]             finished = False

[pytest : test]             t0 = time.time()

[pytest : test]     

[pytest : test]             while not finished:

[pytest : test]                 rsp = self.import_job_client.get_import_job_progress(task_id)

[pytest : test]                 if rsp['data']['state'] == "Completed":

[pytest : test]                     finished = True

[pytest : test]                 time.sleep(5)

[pytest : test]                 if time.time() - t0 > IMPORT_TIMEOUT:

[pytest : test]                     assert False, "import job timeout"

[pytest : test]         c = Collection(name)

[pytest : test] >       c.load(_refresh=True)

[pytest : test] 

[pytest : test] testcases/test_jobs_operation.py:103: 

[pytest : test] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[pytest : test] /usr/local/lib/python3.8/site-packages/pymilvus/orm/collection.py:429: in load

[pytest : test]     conn.load_collection(

[pytest : test] /usr/local/lib/python3.8/site-packages/pymilvus/decorators.py:141: in handler

[pytest : test]     raise e from e

[pytest : test] /usr/local/lib/python3.8/site-packages/pymilvus/decorators.py:137: in handler

[pytest : test]     return func(*args, **kwargs)

[pytest : test] /usr/local/lib/python3.8/site-packages/pymilvus/decorators.py:176: in handler

[pytest : test]     return func(self, *args, **kwargs)

[pytest : test] /usr/local/lib/python3.8/site-packages/pymilvus/decorators.py:116: in handler

[pytest : test]     raise e from e

[pytest : test] /usr/local/lib/python3.8/site-packages/pymilvus/decorators.py:86: in handler

[pytest : test]     return func(*args, **kwargs)

[pytest : test] /usr/local/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py:1171: in load_collection

[pytest : test]     check_status(response)

[pytest : test] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

[pytest : test] 

[pytest : test] status = error_code: UnexpectedError

[pytest : test] reason: "call query coordinator LoadCollection: collection not fully loaded: collection no...l query coordinator LoadCollection: collection not fully loaded: collection not loaded[collection=453468002260052459]"

[pytest : test] 

[pytest : test] 

[pytest : test]     def check_status(status: Status):

[pytest : test]         if status.code != 0 or status.error_code != 0:

[pytest : test] >           raise MilvusException(status.code, status.reason, status.error_code)

[pytest : test] E           pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=call query coordinator LoadCollection: collection not fully loaded: collection not loaded[collection=453468002260052459])>

[pytest : test] 

[pytest : test] /usr/local/lib/python3.8/site-packages/pymilvus/client/utils.py:63: MilvusException

[pytest : test] ------------------------------ Captured log call -------------------------------

[pytest : test] [2024-10-25 08:27:42 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/collections/create, 

[pytest : test] cost time: 2.218313694000244, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "collectionName": "test_collection_2024_10_25_08_27_39_551139EaRekPPo",

[pytest : test]     "schema": {

[pytest : test]         "autoId": false,

[pytest : test]         "enableDynamicField": false,

[pytest : test]         "fields": [

[pytest : test]             {

[pytest : test]                 "fieldName": "book_id",

[pytest : test]                 "dataType": "Int64",

[pytest : test]                 "isPrimary": true,

[pytest : test]                 "elementTypeParams": {}

[pytest : test]             },

[pytest : test]             {

[pytest : test]                 "fieldName": "word_count",

[pytest : test]                 "dataType": "Int64",

[pytest : test]                 "isPartitionKey": false,

[pytest : test]                 "elementTypeParams": {}

[pytest : test]             },

[pytest : test]             {

[pytest : test]                 "fieldName": "book_describe",

[pytest : test]                 "dataType": "VarChar",

[pytest : test]                 "elementTypeParams": {

[pytest : test]                     "max_length": "256"

[pytest : test]                 }

[pytest : test]             },

[pytest : test]             {

[pytest : test]                 "fieldName": "book_intro",

[pytest : test]                 "dataType": "FloatVector",

[pytest : test]                 "elementTypeParams": {

[pytest : test]                     "dim": "128"

[pytest : test]                 }

[pytest : test]             }

[pytest : test]         ]

[pytest : test]     },

[pytest : test]     "indexParams": [

[pytest : test]         {

[pytest : test]             "fieldName": "book_intro",

[pytest : test]             "indexName": "book_intro_vector",

[pytest : test]             "metricType": "L2"

[pytest : test]         }

[pytest : test]     ],

[pytest : test]     "params": {

[pytest : test]         "consistencyLevel": "Strong"

[pytest : test]     }

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{}} (milvus.py:77)

[pytest : test] [2024-10-25 08:27:45 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/create, 

[pytest : test] cost time: 0.009305953979492188, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "collectionName": "test_collection_2024_10_25_08_27_39_551139EaRekPPo",

[pytest : test]     "files": [

[pytest : test]         [

[pytest : test]             "bulk_insert_data_b14dc38d-b36a-41f2-95f2-0bd4931d52a0.json"

[pytest : test]         ]

[pytest : test]     ],

[pytest : test]     "dbName": "default"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"jobId":"453468002260053157"}} (milvus.py:77)

[pytest : test] [2024-10-25 08:27:45 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/create, 

[pytest : test] cost time: 0.010378837585449219, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "collectionName": "test_collection_2024_10_25_08_27_39_551139EaRekPPo",

[pytest : test]     "files": [

[pytest : test]         [

[pytest : test]             "bulk_insert_data_b14dc38d-b36a-41f2-95f2-0bd4931d52a0.json"

[pytest : test]         ]

[pytest : test]     ],

[pytest : test]     "dbName": "default"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"jobId":"453468002260053160"}} (milvus.py:77)

[pytest : test] [2024-10-25 08:27:45 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/list, 

[pytest : test] cost time: 0.006852626800537109, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "collectionName": "test_collection_2024_10_25_08_27_39_551139EaRekPPo",

[pytest : test]     "dbName": "default"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"records":[{"collectionName":"test_collection_2024_10_25_08_27_39_551139EaRekPPo","jobId":"453468002260053160","progress":0,"state":"Pending"},{"collectionName":"test_collection_2024_10_25_08_27_39_551139EaRekPPo","jobId":"453468002260053157","progress":0,"state":"Pending"}]}} (milvus.py:77)

[pytest : test] [2024-10-25 08:27:45 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/get_progress, 

[pytest : test] cost time: 0.006412982940673828, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "dbName": "default",

[pytest : test]     "jobID": "453468002260053160"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"collectionName":"test_collection_2024_10_25_08_27_39_551139EaRekPPo","completeTime":"","details":[],"fileSize":0,"importedRows":0,"jobId":"453468002260053160","progress":0,"state":"Pending","totalRows":0}} (milvus.py:77)

[pytest : test] [2024-10-25 08:27:50 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/get_progress, 

[pytest : test] cost time: 0.006389141082763672, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "dbName": "default",

[pytest : test]     "jobID": "453468002260053160"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"collectionName":"test_collection_2024_10_25_08_27_39_551139EaRekPPo","completeTime":"","details":[],"fileSize":0,"importedRows":0,"jobId":"453468002260053160","progress":10,"state":"Importing","totalRows":0}} (milvus.py:77)

[pytest : test] [2024-10-25 08:27:55 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/get_progress, 

[pytest : test] cost time: 0.005665779113769531, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "dbName": "default",

[pytest : test]     "jobID": "453468002260053160"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"collectionName":"test_collection_2024_10_25_08_27_39_551139EaRekPPo","completeTime":"","details":[{"completeTime":"2024-10-25T08:27:54Z","fileName":"[bulk_insert_data_b14dc38d-b36a-...34610,"importedRows":3000,"progress":100,"state":"Completed","totalRows":3000}],"fileSize":8034610,"importedRows":3000,"jobId":"453468002260053160","progress":70,"state":"Importing","totalRows":3000}} (milvus.py:77)

[pytest : test] [2024-10-25 08:28:00 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/get_progress, 

[pytest : test] cost time: 0.005645036697387695, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "dbName": "default",

[pytest : test]     "jobID": "453468002260053160"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"collectionName":"test_collection_2024_10_25_08_27_39_551139EaRekPPo","completeTime":"","details":[{"completeTime":"2024-10-25T08:27:54Z","fileName":"[bulk_insert_data_b14dc38d-b36a-...34610,"importedRows":3000,"progress":100,"state":"Completed","totalRows":3000}],"fileSize":8034610,"importedRows":3000,"jobId":"453468002260053160","progress":80,"state":"Importing","totalRows":3000}} (milvus.py:77)

[pytest : test] [2024-10-25 08:28:05 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/get_progress, 

[pytest : test] cost time: 0.00798344612121582, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "dbName": "default",

[pytest : test]     "jobID": "453468002260053160"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"collectionName":"test_collection_2024_10_25_08_27_39_551139EaRekPPo","completeTime":"2024-10-25T08:28:02Z","details":[{"completeTime":"2024-10-25T08:27:54Z","fileName":"[bulk_insert...4610,"importedRows":3000,"progress":100,"state":"Completed","totalRows":3000}],"fileSize":8034610,"importedRows":3000,"jobId":"453468002260053160","progress":100,"state":"Completed","totalRows":3000}} (milvus.py:77)

[pytest : test] [2024-10-25 08:28:10 - DEBUG - ci_test]: 

[pytest : test] method: post, 

[pytest : test] url: http://md-37148-1-py-pr-milvus.milvus-ci:19530/v2/vectordb/jobs/import/get_progress, 

[pytest : test] cost time: 0.0065708160400390625, 

[pytest : test] header: {'Content-Type': 'application/json', 'Authorization': 'Bearer None', 'RequestId': 'ff338b9a-92aa-11ef-8800-4e9e04d42779'}, 

[pytest : test] payload: {

[pytest : test]     "dbName": "default",

[pytest : test]     "jobID": "453468002260053157"

[pytest : test] }, 

[pytest : test] response: {"code":0,"data":{"collectionName":"test_collection_2024_10_25_08_27_39_551139EaRekPPo","completeTime":"2024-10-25T08:28:02Z","details":[{"completeTime":"2024-10-25T08:27:54Z","fileName":"[bulk_insert...4610,"importedRows":3000,"progress":100,"state":"Completed","totalRows":3000}],"fileSize":8034610,"importedRows":3000,"jobId":"453468002260053157","progress":100,"state":"Completed","totalRows":3000}} (milvus.py:77)

[pytest : test] [2024-10-25 08:28:15 - ERROR - pymilvus.decorators]: RPC error: [load_collection], <MilvusException: (code=65535, message=call query coordinator LoadCollection: collection not fully loaded: collection not loaded[collection=453468002260052459])>, <Time:{'RPC start': '2024-10-25 08:28:15.288311', 'RPC error': '2024-10-25 08:28:15.290628'}> (decorators.py:140)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed ci job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-37148/1/pipeline

Anything else?

No response

xiaofan-luan commented 2 weeks ago

/assign @weiliu1031

zhuwenxing commented 2 weeks ago

failed ci job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-37148/4/pipeline

This issue currently has a relatively high reproduction probability. @weiliu1031

xiaofan-luan commented 2 weeks ago

/assign @bigsheeper

zhuwenxing commented 1 week ago

failed ci job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-37148/7/pipeline

so set is as critical

zhuwenxing commented 1 week ago

failed ci job: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-37119/13/pipeline

bigsheeper commented 1 week ago

RESTful load operations or quick setup create collection operations don’t wait for loading to finish. We need to add a get load state check and wait for loading to complete before importing. image Please help modify all related cases.@zhuwenxing

bigsheeper commented 1 week ago

/assign @zhuwenxing

zhuwenxing commented 1 week ago
[pytest : test]         c = Collection(name)

[pytest : test] >       c.load(_refresh=True)

the failed step is c.load(_refresh=True), so it should not be related to restful api?

Do you mean that the call of c.load(refresh=True) needs to wait until the load before has completed?

bigsheeper commented 1 week ago

Do you mean that the call of c.load(refresh=True) needs to wait until the load before has completed?

Yes. So the test cases need to be updated to wait for loading to complete. You can add a timeout to the wait process.

BTW, there's also an issue in the server. The import process continues for several dozen seconds, and it's problematic that the collection hasn't completed loading in that time. This is related to the issue https://github.com/milvus-io/milvus/issues/37395.

bigsheeper commented 1 week ago

After test cases done, many collections weren't dropped, occupying the pool of the target observer scheduler. This prevented newly loaded collections from updating the current target, which in turn caused load slowly and time out. image

xiaofan-luan commented 5 days ago

/assign @zhuwenxing could you verify this issue

zhuwenxing commented 4 days ago

@xiaofan-luan So can this PR https://github.com/milvus-io/milvus/pull/37433 solve this issue? Do we need to wait for the https://github.com/milvus-io/milvus/pull/37454 to merge?

bigsheeper commented 4 days ago

@xiaofan-luan So can this PR #37433 solve this issue? Do we need to wait for the #37454 to merge?

I think we can validate this after https://github.com/milvus-io/milvus/pull/37454 merged.

Actually multiple PRs (https://github.com/milvus-io/milvus/pull/37433, https://github.com/milvus-io/milvus/pull/37454, https://github.com/milvus-io/milvus/pull/37513) can help alleviate this issue. Additionally, the improvements to the CI environment and process by @yellow-shine are also beneficial.

zhuwenxing commented 22 hours ago

@bigsheeper What is the expected load completion time for an empty collection after fixing, and is it okay to set the timeout to 5s? Currently, in the CI case, it is set to 60s to ensure it passes without issue. If I want to verify that your PR truly speeds up the load, I plan to set the timeout to 5s.