milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.26k stars 2.9k forks source link

[Bug]: Memory usage of multi collections is beyond expected #35753

Open yhmo opened 2 months ago

yhmo commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4.9
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.4.5
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

vector dimension=512 index type: FLAT 100 collections, 5000 vectors per collection, raw data size = 512 4 5000 * 100 = 1GB the memory usage is > 5GB

Expected Behavior

Memory usage is 5X compared to raw data size, which is terrible. I suppose it should be no more than 2GB.

Steps To Reproduce

Run this script to observe:


import random

from pymilvus import (
    connections,
    FieldSchema, CollectionSchema, DataType,
    Collection,
    utility,
)

connections.connect(host="localhost", port=19530)
print(utility.get_server_version())

dim = 512

col_cnt = 100
for i in range(col_cnt):
    collection_name = "col_" + str(i)
    schema = CollectionSchema(fields=[
        FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
        FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=dim)
    ])

    utility.drop_collection(collection_name=collection_name)
    collection = Collection(name=collection_name, schema=schema)

    index_params = {
        'metric_type': "L2",
        'index_type': "FLAT",
        'params': {},
    }
    collection.create_index(field_name="vector", index_params=index_params)

    collection.load()
    print("Collection created", i)

batch = 5000
for i in range(100):
    data = [
        [[random.random() for _ in range(dim)] for _ in range(batch)]
    ]
    collection_name = "col_" + str(i%col_cnt)
    collection = Collection(collection_name)
    collection.insert(data=data)
    print("insert", i)

print("finish insert")


### Milvus Log

_No response_

### Anything else?

_No response_
yhmo commented 2 months ago

Standalone by docker-compose. docker stats to observe memory usage.

My test results in different versions:

Milvus version Memory usage
2.3.13 5.9GB
2.4.1 6.0GB
2.4.6 5.0GB
2.4.9 5.2GB
xiaofan-luan commented 2 months ago

this might be related to chunk size. each segment has a maximum chunk size. tuning that parameter might work.

yhmo commented 2 months ago

this might be related to chunk size. each segment has a maximum chunk size. tuning that parameter might work.

I only saw "queryNode.segcore.chunkRows", the default value is 128(rows).

yanliang567 commented 2 months ago

@yhmo can you try to limit the memory to 2GB and see if milvus can still hold so many collections? /assign @congqixia please keep an eye on this /unassign

yhmo commented 2 months ago

@yhmo can you try to limit the memory to 2GB and see if milvus can still hold so many collections? /assign @congqixia please keep an eye on this /unassign

Set memory limit to 2GB. Memory quota limit is hit at the No.77 insert batch: <MilvusException: (code=9, message=quota exceeded[reason=memory quota exceeded, please allocate more resources])>

5000 77 512 * 4bytes = 751MB data is inserted.

xiaofan-luan commented 2 months ago

you get memory consumption since:

  1. if index is not build, query and datanode will store growing data twice.
  2. memory is consumed while index building (if use flat it might be better)
  3. chunks as we mentioned. but it still good to know if there are some gaps that we don't know
yhmo commented 2 months ago

The index type is FLAT, so no index node is involved. The interim index is built on query node.

I add a line to flush each collection after insertion. And rerun the script, the memory usage is 1.44GB, which is expected. So the point is why the memory usage is so high when data is in growing segment.

batch = 5000
for i in range(100):
    data = [
        [[random.random() for _ in range(dim)] for _ in range(batch)]
    ]
    collection_name = "col_" + str(i%col_cnt)
    collection = Collection(collection_name)
    collection.insert(data=data)
    print("insert", i)

    collection.flush()
xiaofan-luan commented 2 months ago

maybe you can disable interim index and check?

yhmo commented 2 months ago

maybe you can disable interim index and check?

Seems is not caused by the interim index. I just re-tested with disabled/enabled interim index. The memory usage is 5.xGB after finishing insertion(without flush). And a few minutes later, the memory usage fell to 2.5GB.

xiaofan-luan commented 2 months ago

another thing to think about is memory allocator params. Sometimes it's just some temporary memory usage and memory allocator didn't release those memory into operating system

yanliang567 commented 2 months ago

/assign @cqy123456 please help an eye on this, as you are developing interim index

xiaofan-luan commented 2 months ago

The index type is FLAT, so no index node is involved. The interim index is built on query node.

I add a line to flush each collection after insertion. And rerun the script, the memory usage is 1.44GB, which is expected. So the point is why the memory usage is so high when data is in growing segment.

batch = 5000
for i in range(100):
    data = [
        [[random.random() for _ in range(dim)] for _ in range(batch)]
    ]
    collection_name = "col_" + str(i%col_cnt)
    collection = Collection(collection_name)
    collection.insert(data=data)
    print("insert", i)

    collection.flush()

I thought this could probably becasue soem of the data structure in growing segemnt is different, for example pk index(guess that's the main reason.) we might need jemalloc to check the memory usage

cqy123456 commented 1 month ago

FLAT will not build interim index

cqy123456 commented 1 month ago

I used the above script to test locally, and the intermin index had almost no impact on memory usage(index type = Flat). The peak memory before the system is stable is about 5.2GB, and the memory after stabilization is 2.8GB. Use distributed milvus for memory analysis:

  1. querynode peak mem:2.3GB, cause by OS not release memory in time; stable mem: 1.4GB
  2. datanode: peak mem: 3.4GB, cause by compaction; stabel mem: 920MB; @yhmo
stale[bot] commented 3 days ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.