[Bug]: Query performance not stable

psc0606 commented 2 years ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: 2.1
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.1, java-sdk-2.1.0-beta4
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory: proxy(3 pods, 1core and 8G per pod), all coord(root, data, index) (1pod, 4core and 8G), datanode(10 pods, 2core and 8G per pod), indexnode(20 pods, 4core and 4G per pod), querynode(30pods, 1core and 8G per pod)
- GPU: None
- Others: None

Current Behavior

We have three collection, each of them about 5200w entities, each of entity has two fields: 128dim float vector field, and int64 id field. We load collection with default 2 shards, default 1 replicaNumber, index: IVF_FLAT. The QPS of query request is very low (about 10 QPS, my query request param: topK: 1000, metric_type: L2, params: {\"nprobe\":64}, ). Insert and delete request are also low(<10QPS).

But we get unstable query time cost(P99 about 400ms~600ms), normal query only cost 40 ~ 60ms.

We have milvus monitor:

From querynode, I can see query latency is very high:

What's happen to my querynodes, only a few querynodes have rather high query latency. Finally, at 23:41 made our cluster not responsible. Then milvus cluster seems crash.

Expected Behavior

I expect a relative stable query cost, and a relative balanced query time cost.

Steps To Reproduce

No response

Milvus Log

milvus-2022-08-04_23.log

Anything else?

No response

yanliang567 commented 2 years ago

@psc0606 thank you for the issue. How many vectors(nq) did you use in each search request? Could you please refer this script to export the whole Milvus logs for investigation?

/assign @czs007 /unassign

psc0606 commented 2 years ago

Only one vector in each search request.

psc0606 commented 2 years ago

update: add more log. @yanliang567 I have already upload the whole milvus log.

psc0606 commented 2 years ago

Another interesting thing is when I load collection with two replicaNumer, the query time P99 will only half of the before. But when I add more replicaNumber, the query time cost can not be better.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

yanliang567 commented 2 years ago

@czs007 any updates

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

milvus-io / milvus