milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.07k stars 2.95k forks source link

[Bug]: Load failed reporting `resource group node not enough` (open all mmap) #36584

Closed NicoYuan1986 closed 1 month ago

NicoYuan1986 commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version: master-20240927-7c2cb8c5-amd64 
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Load failed reporting resource group node not enough (open all mmap).

[2024-09-27T08:46:26.494Z] [2024-09-27 07:03:55 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 180], kwargs: {} (api_request.py:62)
[2024-09-27T08:46:26.494Z] [2024-09-27 07:03:55 - ERROR - pymilvus.decorators]: RPC error: [load_collection], <MilvusException: (code=65535, message=call query coordinator LoadCollection: failed to spawn replica for collection: resource group node not enough[rg=__default_resource_group][currentNodeNum=0][expectedNodeNum=1])>, <Time:{'RPC start': '2024-09-27 07:03:55.644998', 'RPC error': '2024-09-27 07:03:55.687226'}> (decorators.py:140)
[2024-09-27T08:46:26.494Z] [2024-09-27 07:03:55 - ERROR - ci_test]: Traceback (most recent call last):
[2024-09-27T08:46:26.494Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
[2024-09-27T08:46:26.494Z]     res = func(*args, **_kwargs)
[2024-09-27T08:46:26.494Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request
[2024-09-27T08:46:26.494Z]     return func(*arg, **kwargs)
[2024-09-27T08:46:26.494Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 429, in load
[2024-09-27T08:46:26.494Z]     conn.load_collection(
[2024-09-27T08:46:26.494Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 141, in handler
[2024-09-27T08:46:26.494Z]     raise e from e
[2024-09-27T08:46:26.494Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 137, in handler
[2024-09-27T08:46:26.494Z]     return func(*args, **kwargs)
[2024-09-27T08:46:26.494Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 176, in handler
[2024-09-27T08:46:26.494Z]     return func(self, *args, **kwargs)
[2024-09-27T08:46:26.494Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 116, in handler
[2024-09-27T08:46:26.494Z]     raise e from e
[2024-09-27T08:46:26.494Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 86, in handler
[2024-09-27T08:46:26.494Z]     return func(*args, **kwargs)
[2024-09-27T08:46:26.494Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 1166, in load_collection
[2024-09-27T08:46:26.494Z]     check_status(response)
[2024-09-27T08:46:26.494Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 63, in check_status
[2024-09-27T08:46:26.494Z]     raise MilvusException(status.code, status.reason, status.error_code)
[2024-09-27T08:46:26.494Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=call query coordinator LoadCollection: failed to spawn replica for collection: resource group node not enough[rg=__default_resource_group][currentNodeNum=0][expectedNodeNum=1])>
[2024-09-27T08:46:26.494Z]  (api_request.py:45)
[2024-09-27T08:46:26.494Z] [2024-09-27 07:03:55 - ERROR - ci_test]: (api_response) : <MilvusException: (code=65535, message=call query coordinator LoadCollection: failed to spawn replica for collection: resource group node not enough[rg=__default_resource_group][currentNodeNum=0][expectedNodeNum=1])> (api_request.py:46)

Expected Behavior

pass

Steps To Reproduce

No response

Milvus Log

  1. link: https://qa-jenkins.milvus.io/blue/organizations/jenkins/existing-milvus/detail/existing-milvus/8/pipeline
  2. failed case: test_query_auto_id_collection
  3. pod:
    kubectl get pods |grep func-mmap-cluster
    func-mmap-cluster-bujnf-etcd-0                                    1/1     Running                           0                 4h3m
    func-mmap-cluster-bujnf-etcd-1                                    1/1     Running                           0                 4h3m
    func-mmap-cluster-bujnf-etcd-2                                    1/1     Running                           0                 4h3m
    func-mmap-cluster-bujnf-milvus-datacoord-676db69475-gpmt7         1/1     Running                           2 (4h2m ago)      4h3m
    func-mmap-cluster-bujnf-milvus-datanode-7b7565494f-4bbsc          1/1     Running                           3 (4h1m ago)      4h3m
    func-mmap-cluster-bujnf-milvus-indexcoord-d87578575-rxsfz         1/1     Running                           0                 4h3m
    func-mmap-cluster-bujnf-milvus-indexnode-7687684558-khmlk         1/1     Running                           3 (4h1m ago)      4h3m
    func-mmap-cluster-bujnf-milvus-proxy-ff5667d77-79f4t              1/1     Running                           3 (4h1m ago)      4h3m
    func-mmap-cluster-bujnf-milvus-querycoord-77d6f47b7b-mnptb        1/1     Running                           2 (4h1m ago)      4h3m
    func-mmap-cluster-bujnf-milvus-querynode-579bc89b4-qlpjc          1/1     Running                           5 (3h11m ago)     4h3m
    func-mmap-cluster-bujnf-milvus-rootcoord-c95777fbd-skxgz          1/1     Running                           3 (4h1m ago)      4h3m
    func-mmap-cluster-bujnf-minio-0                                   1/1     Running                           0                 4h3m
    func-mmap-cluster-bujnf-minio-1                                   1/1     Running                           0                 4h3m
    func-mmap-cluster-bujnf-minio-2                                   1/1     Running                           0                 4h3m
    func-mmap-cluster-bujnf-minio-3                                   1/1     Running                           0                 4h3m
    func-mmap-cluster-bujnf-pulsar-bookie-0                           1/1     Running                           0                 4h3m
    func-mmap-cluster-bujnf-pulsar-bookie-1                           1/1     Running                           0                 4h3m
    func-mmap-cluster-bujnf-pulsar-bookie-2                           1/1     Running                           0                 4h3m
    func-mmap-cluster-bujnf-pulsar-bookie-init-8dvgv                  0/1     Completed                         0                 4h3m
    func-mmap-cluster-bujnf-pulsar-broker-0                           1/1     Running                           0                 4h3m
    func-mmap-cluster-bujnf-pulsar-proxy-0                            1/1     Running                           0                 4h3m
    func-mmap-cluster-bujnf-pulsar-pulsar-init-2kf5r                  0/1     Completed                         0                 4h3m
    func-mmap-cluster-bujnf-pulsar-recovery-0                         1/1     Running                           0                 4h3m
    func-mmap-cluster-bujnf-pulsar-zookeeper-0                        1/1     Running                           0                 4h3m
    func-mmap-cluster-bujnf-pulsar-zookeeper-1                        1/1     Running                           0                 4h2m
    func-mmap-cluster-bujnf-pulsar-zookeeper-2                        1/1     Running                           0                 4h1m

Anything else?

No response

xiaofan-luan commented 2 months ago

/assign @weiliu1031

yanliang567 commented 2 months ago

/unassign

yanliang567 commented 1 month ago

@weiliu1031 @NicoYuan1986 any updates for this issue

NicoYuan1986 commented 1 month ago

not reproduced recently.