milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.06k stars 2.95k forks source link

[Bug]: lots of operations failed because of context canceled #36837

Open Infinite666 opened 1 month ago

Infinite666 commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version: 2.3.5
- Deployment mode(standalone or cluster): standalone (using docker-compose)
- MQ type(rocksmq, pulsar or kafka): default
- SDK version(e.g. pymilvus v2.0.0rc2): milvus-sdk-java v2.3.1
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The first time I started the milvus with docker-compose, the milvus container exited, so I tried to restart around 2024/10/14 02:43:08.377. After restarting milvus and making sure the docker status is healthy, I used Attu to connect to milvus and it was very slow to show the collections. We try to use sdk to search and got "SearchRequest RPC failed!", the connection is refused by milvus: "Caused by: java.net.ConnectException: Connection refuse". After about 5-10 minutes, the service seems recover. I checked the logs of milvus after 02:43:08, and saw many operations failed because of "context canceled". Such as:

[2024/10/14 02:44:53.781 +00:00] [ERROR] [meta/coordinator_broker.go:238] ["failed to fetch index meta"] [collection=453079480318231109] [error="context canceled"] [stack="github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).DescribeIndex\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:238\ngithub.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).subDmChannel\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:354\ngithub.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).executeDmChannelAction\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:315\ngithub.com/milvus-io/milvus/internal/querycoordv2/task.(*Executor).Execute.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/task/executor.go:110"]

[2024/10/14 02:45:11.356 +00:00] [ERROR] [retry/retry.go:46] ["retry func failed"] ["retry time"=0] [error="rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [stack="github.com/milvus-io/milvus/pkg/util/retry.Do\n\t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:46\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).call\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:462\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:548\ngithub.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n\t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:564\ngithub.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:101\ngithub.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).DescribeIndex\n\t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:587\ngithub.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).DescribeIndex\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:233\ngithub.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).sync\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:384\ngithub.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).shouldUpdateCurrentTarget\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:352\ngithub.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).check\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:187\ngithub.com/milvus-io/milvus/internal/querycoordv2/observers.(*taskDispatcher[...]).schedule.func1.1\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/task_dispatcher.go:101\ngithub.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n\t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\ngithub.com/panjf2000/ants/v2.(*goWorker).run.func1\n\t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67"]

[2024/10/14 02:53:56.122 +00:00] [WARN] [querynodev2/handlers.go:176] ["failed to query on delegator"] [traceID=bb9c6f884fc9074e5f692c1853becf20] [msgID=453214057574170626] [collectionID=453079480317975900] [channel=by-dev-rootcoord-dml_1_453079480317975900v0] [scope=All] [error="context canceled"]

[2024/10/14 02:53:56.123 +00:00] [WARN] [proxy/task_query.go:420] ["fail to execute query"] [traceID=a156cefb542a3a3858523dc7472332a4] [collection=453079480317986640] [partitionIDs="[]"] [requestType=query] [error="attempt #0: failed to search/query delegator 4 for channel by-dev-rootcoord-dml_1_453079480317986640v1: fail to Query, QueryNode ID = 4, reason=context canceled: context canceled"]

So I have two questions:

  1. What's the cause of the first exited of milvus?
  2. Why is the service not working during that period after restarting?

Expected Behavior

When all the milvus containers are on the healthy status, the service should be working properly.

Steps To Reproduce

No response

Milvus Log

Sorry, due to our company's policy, I can only upload the logs in multipart zips in my own repo: https://github.com/Infinite666/temp

Anything else?

No response

yanliang567 commented 1 month ago

@Infinite666 could you please attach the milvus logs for investigation? For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.

/assign @Infinite666

xiaofan-luan commented 1 month ago

1.please upgrade to 2.3.22. for the current log this is due to slow in describing index and this is most likely related to too many collections 2.please offer all logs for server side for invesigation.

Infinite666 commented 1 month ago

@Infinite666 could you please attach the milvus logs for investigation? For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.

/assign @Infinite666

Sorry I can't upload the logs for now due to our company's policy, but I will try to find a way to upload the logs later.

We have 10 databases and 90 collections, about 750w+ entities, the size of milvus volumn on disk is about 54G. Are the collections and data size too much for v2.3.5 standalone milvus? We also met another error yesterday, here is the error log raised by service using milvus java sdk to search (but I lost the log of milvus since I removed the old container):

[2024-10-13 17:19:43.110] [http-nio-12756-exec-222] [(ManagedChannelOrphanWrapper.java:159)] - *~*~*~ Channel ManagedChannelImpl{logId=146626, target=xxx:19530} was not shutdown properly!!! ~*~*~*
    Make sure to call shutdown()/shutdownNow() and wait until awaitTermination() returns true.
java.lang.RuntimeException: ManagedChannel allocation site

Two of our collection has more than 200w+ entities, seems like we have met lots of problems after we creating these two big collections, is the data size too much?

xiaofan-luan commented 1 month ago

200w is not huge. as long as you are not frequently flush and delete.

we need server log to do further investigation For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs. for embedded docker use docker log

Infinite666 commented 1 month ago

Here is the server log. Sorry, due to our company's policy, I can only upload the logs in multipart zips in my own repo: https://github.com/Infinite666/temp you can visit and download it from my repo, thanks a lot

yanliang567 commented 1 month ago

I think your etcd service is too slow. please make sure it is running on SSD volumes milvus-etcd | {"level":"warn","ts":"2024-10-14T02:35:48.879Z","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"3.257843554s","expected-duration":"100ms","prefix":"","request":"header:<ID:7587882063776415505 > put:<key:\"by-dev/kv/gid/timestamp\" value_size:8 >","response":"size:6"}

Infinite666 commented 1 month ago

I just checked our device, the volume directory does using a HDD disk, so this is the root cause of the problem? Why did the service back to available after some time, this situation will only happen for a short period right after restart? We also came into this kind of error before (at that time the milvus service has already run for a couple days), is it also because of the HDD disk? milvus-err (2)

yanliang567 commented 1 month ago

yes, please try to change SDD voumes. If milvus lost the heart-beating connection with etcd, it fails.

Infinite666 commented 1 month ago

Thanks for reply, we will try to use SSD volumes to see if we can get rid of this problem. Besides, I know HDD may slow down the process of etcd, but I wonder why will it lead to heart-beating lose between etcd and milvus?

stale[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.