milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.36k stars 2.91k forks source link

[Bug]: Memory leak of flow graphs in DataNode #19587

Closed bigsheeper closed 2 years ago

bigsheeper commented 2 years ago

Is there an existing issue for this?

Environment

- Milvus version: master (GitCommit:1919353)
- Deployment mode(standalone or cluster): both
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

After collection dropped, flow graph in datanode would never be released.

from pymilvus import CollectionSchema, FieldSchema
from pymilvus import Collection
from pymilvus import connections
from pymilvus import DataType
from pymilvus import Partition
from pymilvus import utility
import time
import numpy as np

connections.connect()

dim = 128
int64_field = FieldSchema(name="int64", dtype=DataType.INT64, is_primary=True)
float_field = FieldSchema(name="float", dtype=DataType.FLOAT)
bool_field = FieldSchema(name="bool", dtype=DataType.BOOL)
string_field = FieldSchema(name="string", dtype=DataType.VARCHAR, max_length=65535)
float_vector = FieldSchema(name="float_vector", dtype=DataType.FLOAT_VECTOR, dim=dim)

schema = CollectionSchema(fields=[int64_field, float_field, bool_field, float_vector])
dim = 128
nb = 1
nq = 2
limit = 10
import random
import time
default_search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
vectors = [[random.random() for _ in range(dim)] for _ in range(nb)]

before = time.time()
for i in range(1):
    print("creating collection %d" % (i+1))
    collection = Collection("test_collection_rate_limit_binbin_1_%d" % i, schema=schema)
    print("dropping collection %d" % (i+1))
    collection.drop()

after = time.time()
during = after - before
print(during)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

bigsheeper commented 2 years ago

/unassign @yanliang567 /assign

bigsheeper commented 2 years ago

RootCoord sent CreateCollectionMsg and DropCollectionMsg to DML msgstream very fast. Message ids of CreateCollectionMsg and DropCollectionMsg are both 0.

image

When DataNode seeked from seekPosition, DropCollectionMsg was purged because it's message id is duplicated with CreateCollectionMsg. Therefore, datanode would never receive drop message and the flow graph would never be released.

https://github.com/milvus-io/milvus/blob/1919353f02d3deb79ea097f02e36371ea77bebeb/internal/mq/msgstream/mq_msgstream.go#L760-L769

bigsheeper commented 2 years ago

Maybe related to #19492

bigsheeper commented 2 years ago

solutions:

  1. assign message id to all DDL msg when RootCoord produce;
  2. add 0 msgID check: if msg.ID() !=0 && idset.Contain(msg.ID());
  3. remove duplicated check in msgstream;

@xiaofan-luan @congqixia any suggestions?

congqixia commented 2 years ago

@bigsheeper maybe msgID dedup shall only apply to dml messages?

xiaofan-luan commented 2 years ago

we should actually remove all messageID logic. What about use timestamp for dedeuplication? @bigsheeper @congqixia

bigsheeper commented 2 years ago

we should actually remove all messageID logic. What about use timestamp for dedeuplication? @bigsheeper @congqixia

That would be better.