milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.51k stars 2.83k forks source link

[Bug]: Query performance not stable #18537

Closed psc0606 closed 1 year ago

psc0606 commented 2 years ago

Is there an existing issue for this?

Environment

- Milvus version: 2.1
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.1, java-sdk-2.1.0-beta4
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: proxy(3 pods, 1core and 8G per pod), all coord(root, data, index) (1pod, 4core and 8G), datanode(10 pods, 2core and 8G per pod), indexnode(20 pods, 4core and 4G per pod), querynode(30pods, 1core and 8G per pod)
- GPU: None
- Others: None

Current Behavior

We have three collection, each of them about 5200w entities, each of entity has two fields: 128dim float vector field, and int64 id field. We load collection with default 2 shards, default 1 replicaNumber, index: IVF_FLAT. The QPS of query request is very low (about 10 QPS, my query request param: topK: 1000, metric_type: L2, params: {\"nprobe\":64}, ). Insert and delete request are also low(<10QPS).

But we get unstable query time cost(P99 about 400ms~600ms), normal query only cost 40 ~ 60ms.

We have milvus monitor:

image image image image image image image image image image image image

From querynode, I can see query latency is very high:

image image

What's happen to my querynodes, only a few querynodes have rather high query latency. Finally, at 23:41 made our cluster not responsible. Then milvus cluster seems crash.

Expected Behavior

I expect a relative stable query cost, and a relative balanced query time cost.

Steps To Reproduce

No response

Milvus Log

milvus-2022-08-04_23.log

Anything else?

No response

yanliang567 commented 2 years ago

@psc0606 thank you for the issue. How many vectors(nq) did you use in each search request? Could you please refer this script to export the whole Milvus logs for investigation?

/assign @czs007 /unassign

psc0606 commented 2 years ago

Only one vector in each search request.

psc0606 commented 2 years ago

update: add more log. @yanliang567 I have already upload the whole milvus log.

psc0606 commented 2 years ago

Another interesting thing is when I load collection with two replicaNumer, the query time P99 will only half of the before. But when I add more replicaNumber, the query time cost can not be better.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

yanliang567 commented 2 years ago

@czs007 any updates

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.