milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.04k stars 2.88k forks source link

[Bug]: Service exception #34995

Closed hotwinters closed 1 month ago

hotwinters commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version:2.3.16
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):go  sdk
- OS(Ubuntu or CentOS): CentOS 7
- CPU/Memory: 8/32g
- GPU: 
- Others:

Current Behavior

Every few seconds, the CPU usage of Milvus increases to 5 cores and then decreases. The SDK shows a high delay in creating and deleting libraries, so I restarted my Gosdk server and found that the delay was still there. After restarting Milvus, the functionality returned to normal

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

[2024/07/25 06:17:40.538 +00:00] [ERROR] [retry/retry.go:46] ["retry func failed"] ["retry time"=0] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:104 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:336 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(Client).GetRecoveryInfoV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145 github.com/milvus-io/milvus/internal/querycoordv2/meta.(CoordinatorBroker).GetRecoveryInfoV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:209 github.com/milvus-io/milvus/internal/querycoordv2/meta.(TargetManager).PullNextTargetV2.func1\n/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:238 github.com/milvus-io/milvus/internal/querycoordv2/meta.(TargetManager).PullNextTargetV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:119 github.com/milvus-io/milvus/internal/querycoordv2/meta.(TargetManager).UpdateCollectionNextTarget: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).Call\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553\n | github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).ReCall\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:104\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.(Client).GetRecoveryInfoV2\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:336\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(CoordinatorBroker).GetRecoveryInfoV2\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(TargetManager).PullNextTargetV2.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:209\n | github.com/milvus-io/milvus/pkg/util/retry.Do\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(TargetManager).PullNextTargetV2\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:238\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(TargetManager).UpdateCollectionNextTarget\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:119\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(TargetObserver).updateNextTarget\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:281\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(TargetObserver).check\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:201\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(taskDispatcher[...]).schedule.func1.1\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/task_dispatcher.go:101\n | github.com/milvus-io/milvus/pkg/util/conc.(Pool[...]).Submit.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n | github.com/panjf2000/ants/v2.(goWorker).run.func1\n | \t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).Call\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:567 github.com/milvus-io/milvus/internal/util/grpcclient.(ClientBase[...]).ReCall\n | /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:104 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n | /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:336 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(Client).GetRecoveryInfoV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145 github.com/milvus-io/milvus/internal/querycoordv2/meta.(CoordinatorBroker).GetRecoveryInfoV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:209 github.com/milvus-io/milvus/internal/querycoordv2/meta.(TargetManager).PullNextTargetV2.func1\n | /go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:238 github.com/milvus-io/milvus/internal/querycoordv2/meta.(TargetManager).PullNextTargetV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:119 github.com/milvus-io/milvus/internal/querycoordv2/meta.(TargetManager).UpdateCollectionNextTarget\nWraps: (3) rpc error: code = DeadlineExceeded desc = context deadline exceeded\nError types: (1) withstack.withStack (2) errutil.withPrefix (3) status.Error"] [stack="github.com/milvus-io/milvus/pkg/util/retry.Do\n\t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:46\ngithub.com/milvus-io/milvus/internal/querycoordv2/meta.(TargetManager).PullNextTargetV2\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:238\ngithub.com/milvus-io/milvus/internal/querycoordv2/meta.(TargetManager).UpdateCollectionNextTarget\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:119\ngithub.com/milvus-io/milvus/internal/querycoordv2/observers.(TargetObserver).updateNextTarget\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:281\ngithub.com/milvus-io/milvus/internal/querycoordv2/observers.(TargetObserver).check\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:201\ngithub.com/milvus-io/milvus/internal/querycoordv2/observers.(taskDispatcher[...]).schedule.func1.1\n\t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/task_dispatcher.go:101\ngithub.com/milvus-io/milvus/pkg/util/conc.(Pool[...]).Submit.func1\n\t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\ngithub.com/panjf2000/ants/v2.(goWorker).run.func1\n\t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67"]

Anything else?

No response

yanliang567 commented 2 months ago

@hotwinters quick questions:

  1. how many seconds did earch request increase?
  2. what did you do before the latency increase, what are the reproduce steps?
  3. the error above is a timeout and please offer the full milvus logs for us to investigate the issue. Could you please refer this doc to export the whole Milvus logs for investigation? For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs. /unassign
hotwinters commented 2 months ago

log.zip @yanliang567 This is a compressed RAR file that only exports a portion of the logs. I'm not sure if there is any information you need, but I noticed that during the exception period, it takes about 8-20 seconds for me to create and load a new collection with a delay

xiaofan-luan commented 2 months ago

I don't have any tool to uncomress rar file.

how many collections are there existing in your cluster?

xiaofan-luan commented 2 months ago

/assign @smellthemoon

could you help on it

hotwinters commented 2 months ago

log.zip @xiaofan-luan @smellthemoon I made it into a zip file and re uploaded it,i use stanalone,about 130 col at memory,and I delete 60 temporary collections every 20 minutes and then create another 60

hotwinters commented 2 months ago

log-cpu-100.zip Today I found that the CPU usage rate has been consistently maintained at 100%, and this is the latest log.

hotwinters commented 2 months ago

Why does the CPU continue to use 100% after having too many cols

smellthemoon commented 2 months ago

Why does the CPU continue to use 100% after having too many cols

After restarting, milvus will recover info and load the previous data. However, due to the frequent creation of collections, there are too many tasks in the task pool on the querynode side, the latency increases, the CPU usage continues to rise, and finally milvus restarts.

smellthemoon commented 2 months ago

Frequent creation and deletion of database tables may cause some problems and impacts. If necessary, you can use cluster to improve performance by adding machines. @hotwinters

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.