milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.08k stars 2.88k forks source link

[Bug]: Milvus Volume getting corrupted? #28634

Closed genzerstech closed 9 months ago

genzerstech commented 11 months ago

Is there an existing issue for this?

Environment

- Milvus version: 2.3.1 and 2.3.3
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):   Not using MQ
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.3.3
- OS(Ubuntu or CentOS): 20.04.1-Ubuntu
- CPU/Memory: 432 GB Memory, 64 vCPU - basically Azure Standard_NC64as_T4_v3
- GPU: 4 t4 GPU with  16GB memory each
- Others: na

Current Behavior

We are using Milvus to store embeddings of files. All of a sudden after machine reboots (we are using SPOT machines), the volume is getting corrupted or something and none of the collections are loading.

image

I unloaded all collections and tried to load only one but that is also taking forever to load or not loading at all and stuck at 0% as you can see the number of records per collection is not that huge either. We are planning to use it to load millions of such records and 100's of collections in production runs.

Some errors that I can see when i try 'docker-compose logs' : milvus-standalone | [2023/11/22 01:08:25.878 +00:00] [ERROR] [sessionutil/session_util.go:468] ["retry func failed"] ["retry time"=4] [error="function CompareAndSwap error for compare is false for key: rootcoord"] [stack="github.com/milvus-io/milvus/internal/util/sessionutil.(Session).registerService\n\t/go/src/github.com/milvus-io/milvus/internal/util/sessionutil/session_util.go:468\ngithub.com/milvus-io/milvus/internal/util/sessionutil.(Session).Register\n\t/go/src/github.com/milvus-io/milvus/internal/util/sessionutil/session_util.go:288\ngithub.com/milvus-io/milvus/internal/rootcoord.(Core).Register\n\t/go/src/github.com/milvus-io/milvus/internal/rootcoord/root_coord.go:271\ngithub.com/milvus-io/milvus/internal/distributed/rootcoord.(Server).start\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/rootcoord/service.go:292\ngithub.com/milvus-io/milvus/internal/distributed/rootcoord.(Server).Run\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/rootcoord/service.go:153\ngithub.com/milvus-io/milvus/cmd/components.(RootCoord).Run\n\t/go/src/github.com/milvus-io/milvus/cmd/components/root_coord.go:52\ngithub.com/milvus-io/milvus/cmd/roles.runComponent[...].func1\n\t/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:113"]

milvus-standalone | [2023/11/22 01:09:30.049 +00:00] [WARN] [querycoordv2/services.go:821] ["failed to get replica info"] [collectionID=445764201851805440] [replica=445805914659225601] [error="failed to get channels, collection not loaded: collection=445764201851805440: collection not found"] [errorVerbose="failed to get channels, collection not loaded: collection=445764201851805440: collection not found\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/util/merr.WrapErrCollectionNotFound\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:374\n | [...repeated from below...]\nWraps: (2) failed to get channels, collection not loaded\nWraps: (3) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/util/merr.wrapWithField\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:760\n | github.com/milvus-io/milvus/pkg/util/merr.WrapErrCollectionNotFound\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:372\n | github.com/milvus-io/milvus/internal/querycoordv2.(Server).fillReplicaInfo\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/handlers.go:320\n | github.com/milvus-io/milvus/internal/querycoordv2.(Server).GetReplicas\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/services.go:819\n | github.com/milvus-io/milvus/internal/distributed/querycoord.(Server).GetReplicas\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/service.go:385\n | github.com/milvus-io/milvus/internal/proto/querypb._QueryCoord_GetReplicas_Handler.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:5560\n | github.com/milvus-io/milvus/pkg/util/interceptor.ServerIDValidationUnaryServerInterceptor.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/interceptor/server_id_interceptor.go:54\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n | github.com/milvus-io/milvus/pkg/util/interceptor.ClusterValidationUnaryServerInterceptor.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/interceptor/cluster_interceptor.go:48\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n | github.com/milvus-io/milvus/pkg/util/logutil.UnaryTraceLoggerInterceptor\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/logutil/grpc_interceptor.go:23\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n | go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n | \t/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.38.0/interceptor.go:342\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\n | github.com/milvus-io/milvus/internal/proto/querypb._QueryCoord_GetReplicas_Handler\n | \t/go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:5562\n | google.golang.org/grpc.(Server).processUnaryRPC\n | \t/go/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:1345\n | google.golang.org/grpc.(Server).handleStream\n | \t/go/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:1722\n | google.golang.org/grpc.(Server).serveStreams.func1.2\n | \t/go/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:966\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (4) collection=445764201851805440\nWraps: (5) collection not found\nError types: (1) withstack.withStack (2) errutil.withPrefix (3) withstack.withStack (4) errutil.withPrefix (5) merr.milvusError"]

milvus-standalone | [2023/11/22 01:09:25.941 +00:00] [WARN] [rootcoord/quota_center.go:742] ["failed to get collection rate limit config"] [collectionID=445728297739264245] [error="collection=445728297739264245: collection not found"] [errorVerbose="collection=445728297739264245: collection not found\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/util/merr.wrapWithField\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:760\n | github.com/milvus-io/milvus/pkg/util/merr.WrapErrCollectionNotFound\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:372\n | github.com/milvus-io/milvus/internal/rootcoord.(MetaTable).getLatestCollectionByIDInternal\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/meta_table.go:503\n | github.com/milvus-io/milvus/internal/rootcoord.(MetaTable).getCollectionByIDInternal\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/meta_table.go:511\n | github.com/milvus-io/milvus/internal/rootcoord.(MetaTable).GetCollectionByID\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/meta_table.go:595\n | github.com/milvus-io/milvus/internal/rootcoord.(QuotaCenter).getCollectionLimitConfig\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/quota_center.go:740\n | github.com/milvus-io/milvus/internal/rootcoord.(QuotaCenter).checkDiskQuota\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/quota_center.go:769\n | github.com/milvus-io/milvus/internal/rootcoord.(QuotaCenter).calculateWriteRates\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/quota_center.go:431\n | github.com/milvus-io/milvus/internal/rootcoord.(QuotaCenter).calculateRates\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/quota_center.go:680\n | github.com/milvus-io/milvus/internal/rootcoord.(QuotaCenter).run\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/quota_center.go:149\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) collection=445728297739264245\nWraps: (3) collection not found\nError types: (1) withstack.withStack (2) errutil.withPrefix (3) merr.milvusError"]

milvus-standalone | [2023/11/22 01:09:25.042 +00:00] [WARN] [querycoordv2/services.go:821] ["failed to get replica info"] [collectionID=445764201851805440] [replica=445805914659225601] [error="failed to get channels, collection not loaded: collection=445764201851805440: collection not found"] [errorVerbose="failed to get channels, collection not loaded: collection=445764201851805440: collection not found\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/util/merr.WrapErrCollectionNotFound\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:374\n | [...repeated from below...]\nWraps: (2) failed to get channels, collection not loaded\nWraps: (3) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/util/merr.wrapWithField\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:760\n | github.com/milvus-io/milvus/pkg/util/merr.WrapErrCollectionNotFound\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:372\n | github.com/milvus-io/milvus/internal/querycoordv2.(Server).fillReplicaInfo\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/handlers.go:320\n | github.com/milvus-io/milvus/internal/querycoordv2.(Server).GetReplicas\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/services.go:819\n | github.com/milvus-io/milvus/internal/distributed/querycoord.(Server).GetReplicas\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/service.go:385\n | github.com/milvus-io/milvus/internal/proto/querypb._QueryCoord_GetReplicas_Handler.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:5560\n | github.com/milvus-io/milvus/pkg/util/interceptor.ServerIDValidationUnaryServerInterceptor.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/interceptor/server_id_interceptor.go:54\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n | github.com/milvus-io/milvus/pkg/util/interceptor.ClusterValidationUnaryServerInterceptor.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/interceptor/cluster_interceptor.go:48\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n | github.com/milvus-io/milvus/pkg/util/logutil.UnaryTraceLoggerInterceptor\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/logutil/grpc_interceptor.go:23\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n | go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n | \t/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.38.0/interceptor.go:342\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\n | github.com/milvus-io/milvus/internal/proto/querypb._QueryCoord_GetReplicas_Handler\n | \t/go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:5562\n | google.golang.org/grpc.(Server).processUnaryRPC\n | \t/go/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:1345\n | google.golang.org/grpc.(Server).handleStream\n | \t/go/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:1722\n | google.golang.org/grpc.(Server).serveStreams.func1.2\n | \t/go/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:966\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (4) collection=445764201851805440\nWraps: (5) collection not found\nError types: (1) withstack.withStack (2) errutil.withPrefix (3) withstack.withStack (4) errutil.withPrefix (5) merr.milvusError"]

When I try to peek the data using Attu, I get an error as attached.

image

Expected Behavior

Collections should load properly. This is the 3rd time this is happening for us and we end up deleting the volumes directory and start afresh as we are still in development but for production loads, this is just an impossible situation to get out from!!

Steps To Reproduce

Happens Randomly. I was using Milvus 2.3.1 .. even upgraded to 2.3.3 but issue is the same.

Milvus Log

No response

Anything else?

No response

yanliang567 commented 11 months ago

@genzerstech Please do NOT use SPOT instances for milvus, especially for querynodes and cooridators. Please provide the full milvus logs for investigations. Where did the milvus volume locate?

/assign @genzerstech /unassign

xiaofan-luan commented 11 months ago

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version: 2.3.1 and 2.3.3
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):   Not using MQ
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.3.3
- OS(Ubuntu or CentOS): 20.04.1-Ubuntu
- CPU/Memory: 432 GB Memory, 64 vCPU - basically Azure Standard_NC64as_T4_v3
- GPU: 4 t4 GPU with  16GB memory each
- Others: na

Current Behavior

We are using Milvus to store embeddings of files. All of a sudden after machine reboots (we are using SPOT machines), the volume is getting corrupted or something and none of the collections are loading.

image

I unloaded all collections and tried to load only one but that is also taking forever to load or not loading at all and stuck at 0% as you can see the number of records per collection is not that huge either. We are planning to use it to load millions of such records and 100's of collections in production runs.

Some errors that I can see when i try 'docker-compose logs' : milvus-standalone | [2023/11/22 01:08:25.878 +00:00] [ERROR] [sessionutil/session_util.go:468] ["retry func failed"] ["retry time"=4] [error="function CompareAndSwap error for compare is false for key: rootcoord"] [stack="github.com/milvus-io/milvus/internal/util/sessionutil.(Session).registerService\n\t/go/src/github.com/milvus-io/milvus/internal/util/sessionutil/session_util.go:468\ngithub.com/milvus-io/milvus/internal/util/sessionutil.(Session).Register\n\t/go/src/github.com/milvus-io/milvus/internal/util/sessionutil/session_util.go:288\ngithub.com/milvus-io/milvus/internal/rootcoord.(Core).Register\n\t/go/src/github.com/milvus-io/milvus/internal/rootcoord/root_coord.go:271\ngithub.com/milvus-io/milvus/internal/distributed/rootcoord.(Server).start\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/rootcoord/service.go:292\ngithub.com/milvus-io/milvus/internal/distributed/rootcoord.(Server).Run\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/rootcoord/service.go:153\ngithub.com/milvus-io/milvus/cmd/components.(RootCoord).Run\n\t/go/src/github.com/milvus-io/milvus/cmd/components/root_coord.go:52\ngithub.com/milvus-io/milvus/cmd/roles.runComponent[...].func1\n\t/go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:113"]

milvus-standalone | [2023/11/22 01:09:30.049 +00:00] [WARN] [querycoordv2/services.go:821] ["failed to get replica info"] [collectionID=445764201851805440] [replica=445805914659225601] [error="failed to get channels, collection not loaded: collection=445764201851805440: collection not found"] [errorVerbose="failed to get channels, collection not loaded: collection=445764201851805440: collection not found\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/util/merr.WrapErrCollectionNotFound\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:374\n | [...repeated from below...]\nWraps: (2) failed to get channels, collection not loaded\nWraps: (3) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/util/merr.wrapWithField\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:760\n | github.com/milvus-io/milvus/pkg/util/merr.WrapErrCollectionNotFound\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:372\n | github.com/milvus-io/milvus/internal/querycoordv2.(Server).fillReplicaInfo\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/handlers.go:320\n | github.com/milvus-io/milvus/internal/querycoordv2.(Server).GetReplicas\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/services.go:819\n | github.com/milvus-io/milvus/internal/distributed/querycoord.(Server).GetReplicas\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/service.go:385\n | github.com/milvus-io/milvus/internal/proto/querypb._QueryCoord_GetReplicas_Handler.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:5560\n | github.com/milvus-io/milvus/pkg/util/interceptor.ServerIDValidationUnaryServerInterceptor.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/interceptor/server_id_interceptor.go:54\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n | github.com/milvus-io/milvus/pkg/util/interceptor.ClusterValidationUnaryServerInterceptor.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/interceptor/cluster_interceptor.go:48\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n | github.com/milvus-io/milvus/pkg/util/logutil.UnaryTraceLoggerInterceptor\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/logutil/grpc_interceptor.go:23\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n | go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n | \t/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.38.0/interceptor.go:342\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\n | github.com/milvus-io/milvus/internal/proto/querypb._QueryCoord_GetReplicas_Handler\n | \t/go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:5562\n | google.golang.org/grpc.(Server).processUnaryRPC\n | \t/go/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:1345\n | google.golang.org/grpc.(Server).handleStream\n | \t/go/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:1722\n | google.golang.org/grpc.(Server).serveStreams.func1.2\n | \t/go/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:966\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (4) collection=445764201851805440\nWraps: (5) collection not found\nError types: (1) withstack.withStack (2) errutil.withPrefix (3) withstack.withStack (4) errutil.withPrefix (5) merr.milvusError"]

milvus-standalone | [2023/11/22 01:09:25.941 +00:00] [WARN] [rootcoord/quota_center.go:742] ["failed to get collection rate limit config"] [collectionID=445728297739264245] [error="collection=445728297739264245: collection not found"] [errorVerbose="collection=445728297739264245: collection not found\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/util/merr.wrapWithField\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:760\n | github.com/milvus-io/milvus/pkg/util/merr.WrapErrCollectionNotFound\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:372\n | github.com/milvus-io/milvus/internal/rootcoord.(MetaTable).getLatestCollectionByIDInternal\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/meta_table.go:503\n | github.com/milvus-io/milvus/internal/rootcoord.(MetaTable).getCollectionByIDInternal\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/meta_table.go:511\n | github.com/milvus-io/milvus/internal/rootcoord.(MetaTable).GetCollectionByID\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/meta_table.go:595\n | github.com/milvus-io/milvus/internal/rootcoord.(QuotaCenter).getCollectionLimitConfig\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/quota_center.go:740\n | github.com/milvus-io/milvus/internal/rootcoord.(QuotaCenter).checkDiskQuota\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/quota_center.go:769\n | github.com/milvus-io/milvus/internal/rootcoord.(QuotaCenter).calculateWriteRates\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/quota_center.go:431\n | github.com/milvus-io/milvus/internal/rootcoord.(QuotaCenter).calculateRates\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/quota_center.go:680\n | github.com/milvus-io/milvus/internal/rootcoord.(QuotaCenter).run\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/quota_center.go:149\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) collection=445728297739264245\nWraps: (3) collection not found\nError types: (1) withstack.withStack (2) errutil.withPrefix (3) merr.milvusError"]

milvus-standalone | [2023/11/22 01:09:25.042 +00:00] [WARN] [querycoordv2/services.go:821] ["failed to get replica info"] [collectionID=445764201851805440] [replica=445805914659225601] [error="failed to get channels, collection not loaded: collection=445764201851805440: collection not found"] [errorVerbose="failed to get channels, collection not loaded: collection=445764201851805440: collection not found\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/util/merr.WrapErrCollectionNotFound\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:374\n | [...repeated from below...]\nWraps: (2) failed to get channels, collection not loaded\nWraps: (3) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/util/merr.wrapWithField\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:760\n | github.com/milvus-io/milvus/pkg/util/merr.WrapErrCollectionNotFound\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:372\n | github.com/milvus-io/milvus/internal/querycoordv2.(Server).fillReplicaInfo\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/handlers.go:320\n | github.com/milvus-io/milvus/internal/querycoordv2.(Server).GetReplicas\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/services.go:819\n | github.com/milvus-io/milvus/internal/distributed/querycoord.(Server).GetReplicas\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/querycoord/service.go:385\n | github.com/milvus-io/milvus/internal/proto/querypb._QueryCoord_GetReplicas_Handler.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:5560\n | github.com/milvus-io/milvus/pkg/util/interceptor.ServerIDValidationUnaryServerInterceptor.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/interceptor/server_id_interceptor.go:54\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n | github.com/milvus-io/milvus/pkg/util/interceptor.ClusterValidationUnaryServerInterceptor.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/interceptor/cluster_interceptor.go:48\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n | github.com/milvus-io/milvus/pkg/util/logutil.UnaryTraceLoggerInterceptor\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/logutil/grpc_interceptor.go:23\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n | go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1\n | \t/go/pkg/mod/go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.38.0/interceptor.go:342\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25\n | github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1\n | \t/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34\n | github.com/milvus-io/milvus/internal/proto/querypb._QueryCoord_GetReplicas_Handler\n | \t/go/src/github.com/milvus-io/milvus/internal/proto/querypb/query_coord.pb.go:5562\n | google.golang.org/grpc.(Server).processUnaryRPC\n | \t/go/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:1345\n | google.golang.org/grpc.(Server).handleStream\n | \t/go/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:1722\n | google.golang.org/grpc.(Server).serveStreams.func1.2\n | \t/go/pkg/mod/google.golang.org/grpc@v1.54.0/server.go:966\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (4) collection=445764201851805440\nWraps: (5) collection not found\nError types: (1) withstack.withStack (2) errutil.withPrefix (3) withstack.withStack (4) errutil.withPrefix (5) merr.milvusError"]

When I try to peek the data using Attu, I get an error as attached.

image

Expected Behavior

Collections should load properly. This is the 3rd time this is happening for us and we end up deleting the volumes directory and start afresh as we are still in development but for production loads, this is just an impossible situation to get out from!!

Steps To Reproduce

Happens Randomly. I was using Milvus 2.3.1 .. even upgraded to 2.3.3 but issue is the same.

Milvus Log

No response

Anything else?

No response

could you offer full logs of your environment? All the logs you offered seems to be a load failed. But those seems to be not the root cause.

  1. does this cluster successfully running for a while?
  2. If it works previously, what happend? Did it reboot? Did you upgrade? what kind of operation you did?
  3. What Object store, Stream storage and Meta you are using? could you offer your config?
209ye commented 10 months ago

I'm having the same issue where the docker container restarts automatically after a power failure, but it won't load an index that has already been built. This is manifested by using attu to load a collection with already built indexes, the progress keeps getting stuck at 0% and after a while attu automatically stops loading. I am using docker-compose default configuration for development. Using IVF-SQ8 IVF-PQ as index.

  1. does this cluster successfully running for a while? When the reboot completes, a few tables can be loaded successfully, but I have not retrieved them and am not sure if they are running correctly.

  2. If it works previously, what happend? Did it reboot? Did you upgrade? what kind of operation you did? The machine shuts down without using docker compose down and then the indexes don't load. I tried to upgrade milvus-standlone from v2.3.2 to v2.3.3 in the middle of this, but it didn't work.

  3. What Object store, Stream storage and Meta you are using? could you offer your config? The default configuration of milvus docker compose is used.

yanliang567 commented 10 months ago

@209ye could you please attach the full milvus logs? For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.

stale[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.