milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.05k stars 2.95k forks source link

[Bug]: querynode memory keeps increasing even the rows counts did not changed a lot by insert and delete requests #24597

Closed Satyadev592 closed 1 year ago

Satyadev592 commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: 2.2.3
- Deployment mode(standalone or cluster): Cluster (AWS, not on k8s)
- MQ type(rocksmq, pulsar or kafka): Pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.2.3
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: 8/64
- GPU: -
- Others: -

Current Behavior

We are noticing an increase in query nodes RAM utilisation over time. Our day level operations can be classified as follows: 300K inserts 300K deletes 1M updates (delete followed by insert) Technically our num_entities is not changing over time but our query node utilisation is increasing over time. Here's a graph illustrating our issue:

image

We have tried manual compaction every day to see if that solves the issue - but it does not. We are currently doing rolling restarts of the query nodes which gets the query nodes back to 35GB usage.

Expected Behavior

The memory utilisation for the query nodes should plateau for our use case as the number of entities/vectors is not increasing.

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

yanliang567 commented 1 year ago

@Satyadev592 sounds like a known issue about compaction and handoff in v2.2.3, any chance to run a test on v2.2.8? if it reproduced, could you please attach all the milvus pods for investigation? /assign @Satyadev592

yanliang567 commented 1 year ago

we have a similar test scenario in house with latest build: every day we do insert and delete with almost same counts to make the total rows not changed a lot. We can see that the memory of query node keeps stable image

xiaofan-luan commented 1 year ago

keeps stable

try to trigger on compaction and see what's happening?

Using 2.2.8 might be a better choice

yanliang567 commented 1 year ago

@Satyadev592 we have release v2.2.9, any chance to run a test on the latest release?

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

Satyadev592 commented 1 year ago

We upgraded to 2.2.9 and thought this issue is behind us but unfortunately it was short lived. The issue where memory util spikes up randomly presented itself yet again. Attaching some graphs for reference.

The memory spike seems to be correlated with DDL requests showpartitions and describecollection.

Screenshot 2023-10-09 at 10 03 49 AM Screenshot 2023-10-09 at 10 04 18 AM
yanliang567 commented 1 year ago

@Satyadev592 Could you please attach the etcd backup for investigation? Check this: https://github.com/milvus-io/birdwatcher for details about how to backup etcd with birdwatcher

Satyadev592 commented 1 year ago

Untitled.txt Here's the backup as requested.

xiaofan-luan commented 1 year ago

@Satyadev592 could you try to manual compact and check if memory descrease?