milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.06k stars 2.95k forks source link

[Bug]: Can not load collection after restarting #35630

Closed tanvlt closed 3 weeks ago

tanvlt commented 3 months ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4.5
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): Kubernetes 
- CPU/Memory: 4/16
- GPU: None
- Others: Deploy on Azure AKS and using Azure Blod Storage

Current Behavior

After restarting suddenly the collections can not be loaded image

image

Expected Behavior

Steps To Reproduce

Can not reproduce as mentioned  I have create another Milvus on the same AKS cluster, same configuration and it also works after restarting
I have tried to start that Milvus many times

Milvus Log

milvus-s3-pilot-etcd-0.log pod_log.zip

Anything else?

No response

xiaofan-luan commented 3 months ago

it seems you have too many channels. Some of the load task timeout.

probably caused by this bug https://github.com/milvus-io/milvus/issues/35008 Could you upgrade to 2.4.9 and retry?

tanvlt commented 3 months ago

Hi @xiaofan-luan , i have upgraded to 2.4.9 and tried to start again but did not help, it still can not startup again image There are a lot of logs like bellow

[2024/08/23 03:57:16.150 +00:00] [INFO] [balance/utils.go:115] ["create channel task"] [collection=451222158446365459] [replica=451337904531177490] [channel=by-dev-rootcoord-dml_8_451222158446365459v0] [from=-1] [to=54]
[2024/08/23 03:57:16.150 +00:00] [INFO] [balance/utils.go:115] ["create channel task"] [collection=451021127533543049] [replica=451222161215193148] [channel=by-dev-rootcoord-dml_0_451021127533543049v0] [from=-1] [to=54]
[2024/08/23 03:57:16.150 +00:00] [INFO] [balance/utils.go:115] ["create channel task"] [collection=451639263225372070] [replica=451639267229434001] [channel=by-dev-rootcoord-dml_12_451639263225372070v0] [from=-1] [to=54]
[2024/08/23 03:57:16.150 +00:00] [INFO] [balance/utils.go:115] ["create channel task"] [collection=451708688926640217] [replica=451708692028194863] [channel=by-dev-rootcoord-dml_4_451708688926640217v0] [from=-1] [to=54]
[2024/08/23 03:57:16.150 +00:00] [INFO] [balance/utils.go:115] ["create channel task"] [collection=451021127535038140] [replica=451310628132618244] [channel=by-dev-rootcoord-dml_12_451021127535038140v0] [from=-1] [to=54]
[2024/08/23 03:57:16.150 +00:00] [INFO] [balance/utils.go:115] ["create channel task"] [collection=451708688918054141] [replica=451708692028194847] [channel=by-dev-rootcoord-dml_0_451708688918054141v0] [from=-1] [to=54]
[2024/08/23 03:57:16.150 +00:00] [INFO] [balance/utils.go:115] ["create channel task"] [collection=451639263220524381] [replica=451639267229433866] [channel=by-dev-rootcoord-dml_7_451639263220524381v0] [from=-1] [to=54]
[2024/08/23 03:57:16.150 +00:00] [INFO] [balance/utils.go:115] ["create channel task"] [collection=451639263225110322] [replica=451639267229433976] [channel=by-dev-rootcoord-dml_13_451639263225110322v0] [from=-1] [to=54]
[2024/08/23 03:57:16.186 +00:00] [WARN] [rootcoord/quota_center.go:315] ["quotaCenter collect metrics failed"] [error="collection not found[collection=451437057771314021]"]
[2024/08/23 03:57:16.229 +00:00] [INFO] [task/scheduler.go:643] ["processed tasks"] [nodeID=54] [toProcessNum=289] [committedNum=0] [toRemoveNum=0]
[2024/08/23 03:57:16.229 +00:00] [INFO] [task/scheduler.go:649] ["process tasks related to node done"] [nodeID=54] [processingTaskNum=289] [waitingTaskNum=0] [segmentTaskNum=0] [channelTaskNum=289]
[2024/08/23 03:57:16.262 +00:00] [INFO] [msgstream/mq_msgstream.go:939] ["skip msg"] [source=27] [type=TimeTick] [size=17] [position=<nil>]
[2024/08/23 03:57:16.262 +00:00] [INFO] [msgstream/mq_msgstream.go:939] ["skip msg"] [source=27] [type=TimeTick] [size=17] [position=<nil>]
[2024/08/23 03:57:16.262 +00:00] [INFO] [msgstream/mq_msgstream.go:939] ["skip msg"] [source=27] [type=TimeTick] [size=17] [position=<nil>]
[2024/08/23 03:57:16.262 +00:00] [INFO] [msgstream/mq_msgstream.go:939] ["skip msg"] [source=27] [type=TimeTick] [size=17] [position=<nil>]
[2024/08/23 03:57:16.262 +00:00] [INFO] [msgstream/mq_msgstream.go:939] ["skip msg"] [source=27] [type=TimeTick] [size=17] [position=<nil>]
[2024/08/23 03:57:16.262 +00:00] [INFO] [msgstream/mq_msgstream.go:939] ["skip msg"] [source=27] [type=TimeTick] [size=17] [position=<nil>]
[2024/08/23 03:57:16.262 +00:00] [INFO] [msgstream/mq_msgstream.go:939] ["skip msg"] [source=27] [type=TimeTick] [size=17] [position=<nil>]
[2024/08/23 03:57:16.262 +00:00] [INFO] [msgstream/mq_msgstream.go:939] ["skip msg"] [source=27] [type=TimeTick] [size=17] [position=<nil>]
[2024/08/23 03:57:16.262 +00:00] [INFO] [msgstream/mq_msgstream.go:939] ["skip msg"] [source=27] [type=TimeTick] [size=17] [position=<nil>]
[2024/08/23 03:57:16.262 +00:00] [INFO] [msgstream/mq_msgstream.go:939] ["skip msg"] [source=27] [type=TimeTick] [size=17] [position=<nil>]
[2024/08/23 03:57:16.262 +00:00] [INFO] [msgstream/mq_msgstream.go:939] ["skip msg"] [source=27] [type=TimeTick] [size=17] [position=<nil>]
xiaofan-luan commented 3 months ago

I think this is just becasue there are many channels. Need to wait until all the channels to load might take some time.

check "Stop timer for ToWatch operation succeeded" see is there are some new succeed channels

yanliang567 commented 3 months ago

/assign @congqixia /unassign

tanvlt commented 3 months ago

hi @xiaofan-luan unfortunately i had been waited for it in a long time but still did not start completely

stale[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.