milvus-io / milvus-sdk-java

Java SDK for Milvus.
https://milvus.io
Apache License 2.0
380 stars 153 forks source link

Loading collection console log keeps looping with errors in milvus-sdk-java 2.4.0 #880

Open CSi-CJ opened 4 months ago

CSi-CJ commented 4 months ago

Problem Description

When calling the loadCollection method after creating a collection with milvus-sdk-java 2.4.0, the MilvusServiceClient keeps executing a loading loop and throwing errors. Weirdly, milvus-attu shows that the collection has been loaded, but the console log keeps looping with errors. image image

Error Log

2024-04-26T15:31:33.347+08:00 RID-b04b984f-fb0f-44d6-9ca7-780db24edb53  WARN 34720 --- [nio-8443-exec-1] i.m.client.AbstractMilvusGrpcClient      : Retry(6) with interval 2430ms. Reason: CANCELLED: Failed to read message.
2024-04-26T15:31:35.806+08:00 RID-b04b984f-fb0f-44d6-9ca7-780db24edb53 ERROR 34720 --- [nio-8443-exec-1] i.m.client.AbstractMilvusGrpcClient      : LoadCollectionRequest collectionName:Entity_100000001_Multi_Vector_3cf4e5916b0549b7ab79d6c0b71be4ce RPC failed! Exception:{}

io.grpc.StatusRuntimeException: CANCELLED: Failed to read message.
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:275) ~[grpc-stub-1.57.2.jar:1.57.2]
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:256) ~[grpc-stub-1.57.2.jar:1.57.2]
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:169) ~[grpc-stub-1.57.2.jar:1.57.2]
    at io.milvus.grpc.MilvusServiceGrpc$MilvusServiceBlockingStub.showCollections(MilvusServiceGrpc.java:4073) ~[milvus-sdk-java-2.4.0.jar:na]
    at io.milvus.client.AbstractMilvusGrpcClient.waitForLoadingCollection(AbstractMilvusGrpcClient.java:94) ~[milvus-sdk-java-2.4.0.jar:na]
    at io.milvus.client.AbstractMilvusGrpcClient.loadCollection(AbstractMilvusGrpcClient.java:565) ~[milvus-sdk-java-2.4.0.jar:na]
    at io.milvus.client.MilvusServiceClient.lambda$loadCollection$8(MilvusServiceClient.java:454) ~[milvus-sdk-java-2.4.0.jar:na]
    at io.milvus.client.MilvusServiceClient.retry(MilvusServiceClient.java:290) ~[milvus-sdk-java-2.4.0.jar:na]
    at io.milvus.client.MilvusServiceClient.loadCollection(MilvusServiceClient.java:454) ~[milvus-sdk-java-2.4.0.jar:na]
    at com.ot.ais.service.search.data.impl.MilvusDatabaseServiceImpl.loadCollection(MilvusDatabaseServiceImpl.java:250) ~[classes/:na]
    at com.ot.ais.service.search.data.impl.MilvusDatabaseServiceImpl.createIndexesAndLoadCollection(MilvusDatabaseServiceImpl.java:151) ~[classes/:na]
    at com.ot.ais.service.search.data.impl.MilvusDatabaseServiceImpl.createCollection(MilvusDatabaseServiceImpl.java:131) ~[classes/:na]

Environment

Steps to Reproduce

  1. define method loadCollection:
    public R<RpcStatus> loadCollection(String collectionName) {
        return milvusServiceClient.loadCollection(
          LoadCollectionParam.newBuilder()
            .withCollectionName(collectionName)
            .build()
        );
    }
  2. invoke loadCollection()

Expected Behavior

The collection should be loaded successfully without looping errors.

Additional Information

Hope someone can help me resolve this issue.

Additionally, I noticed that the MilvusServiceClient has a default retry mechanism for almost every database interaction with private int maxRetryTimes = 75. Why is the retry count set to 75? Is there any specific reason behind this number?

image

yhmo commented 4 months ago

The retry machinery is consistent with the milvus python sdk which is as-designed: https://github.com/milvus-io/pymilvus/blob/1081c49fcc21039300fec22e7b19805be8f198f0/pymilvus/decorators.py#L42

The loadCollection() calls showCollection() to check loading progress. Seems the showCollection() failed in rpc.

"CANCELLED: Failed to read message" is a GRPC error, it indicates the connection is broken or closed.

CSi-CJ commented 4 months ago

Yeah, it seems like grpc connection has crashed. I launched the Milvus standalone cluster in the local Ubuntu environment, the infra is as below: I almost found the problem where is, cause my Milvus helm chart installs failed, and the query-node pod has not been found. I think maybe reinstalling the Milvus cluster can work normally. please help me confirm whether the cluster status is correct image

yhmo commented 4 months ago

The querycoord failed to initialize. Need the full log to know what the error is.