milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.91k stars 2.95k forks source link

[Bug]: There is always a periodic slow response when requesting milvus. #26076

Closed Richard-lrg closed 1 year ago

Richard-lrg commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:2.2.11
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

We switched to the milvus cluster version of the service as a whole last Friday, but today we found that when we request milvus, there are always periodic slow responses.

DEADLINE_EXCEEDED: deadline exceeded after 19.999962569s

So I checked the logs and found that there are many such logs on the milvus-proxy

[2023/08/01 13:28:32.332 +00:00] [WARN] [proxy/task_search.go:439] ["first search failed, updating shardleader caches and retry search"] [traceID=546290382f5b1ad7] [msgId=443258370085617665] [error="All attempts results:\nattempt #1:context canceled\n"] [2023/08/01 13:28:32.332 +00:00] [INFO] [proxy/meta_cache.go:836] ["clearing shard cache for collection"] [collectionName=xxx] [2023/08/01 13:28:32.332 +00:00] [WARN] [retry/retry.go:44] ["retry func failed"] ["retry time"=0] [error="All attempts results:\nattempt #1:context canceled\n"] [2023/08/01 13:28:32.332 +00:00] [WARN] [proxy/task_scheduler.go:473] ["Failed to execute task: "] [error="fail to search on all shard leaders, err=All attempts results:\nattempt #1:All attempts results:\nattempt #1:context canceled\n\nattempt #2:context canceled\n"] [traceID=546290382f5b1ad7]

"expire all shard leader cache" Such logs are very frequent, why is this happening? Is the periodic slow response caused by the cache being freed and then reloaded.

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

[2023/08/01 13:28:32.332 +00:00] [WARN] [proxy/task_search.go:439] ["first search failed, updating shardleader caches and retry search"] [traceID=546290382f5b1ad7] [msgId=443258370085617665] [error="All attempts results:\nattempt #1:context canceled\n"] [2023/08/01 13:28:32.332 +00:00] [INFO] [proxy/meta_cache.go:836] ["clearing shard cache for collection"] [collectionName=xxx] [2023/08/01 13:28:32.332 +00:00] [WARN] [retry/retry.go:44] ["retry func failed"] ["retry time"=0] [error="All attempts results:\nattempt #1:context canceled\n"] [2023/08/01 13:28:32.332 +00:00] [WARN] [proxy/task_scheduler.go:473] ["Failed to execute task: "] [error="fail to search on all shard leaders, err=All attempts results:\nattempt #1:All attempts results:\nattempt #1:context canceled\n\nattempt #2:context canceled\n"] [traceID=546290382f5b1ad7]

Anything else?

No response

yanliang567 commented 1 year ago

@Cactus-L quick questions:

  1. during that slow response period, what requests are running to Milvus? any insert or delete requests?
  2. does your milvus running on the exclusive hosts? it helps us to understand if there are any resource competitions at that moment?
  3. do you happen to have any screenshot of milvus metrics on grafana? It helps us to know what was happen in proxy, querynode and runtime.
  4. Could you please refer this doc to export the whole Milvus logs for investigation? /assign @Cactus-L
Richard-lrg commented 1 year ago

@Cactus-L quick questions:

  1. during that slow response period, what requests are running to Milvus? any insert or delete requests?
  2. does your milvus running on the exclusive hosts? it helps us to understand if there are any resource competitions at that moment?
  3. do you happen to have any screenshot of milvus metrics on grafana? It helps us to know what was happen in proxy, querynode and runtime.
  4. Could you please refer this doc to export the whole Milvus logs for investigation? /assign @Cactus-L
  1. During periods of slow response, query, insert, and delete requests are sent to Milvus.
  2. my milvus running on the exclusive hosts.
  3. I haven't built grafana monitoring yet, I will build it, thanks for your suggestion.
  4. I will try it.
yanliang567 commented 1 year ago

if there are insert/delete requests are sent to Miluvs during that periods, it is expected as the new inserted data will be search by brute force search. please feel free to let us know if any updates

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.