[Bug]: Pod CPU and memory grows linearly.

elonzh commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: 2.2.6
- Deployment mode(standalone or cluster): standalone 
- MQ type(rocksmq, pulsar or kafka): rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2): 
- OS(Ubuntu or CentOS): Alibaba Cloud Linux 3 (Soaring Falcon)
- Kernel Version : 5.10.134-12.2.al8.x86_64
- CPU/Memory: 4c8g
- GPU: None

Current Behavior

I changed the milvus config and reset the data, still not working.

extraConfigFiles:
  user.yaml: |+
    rocksmq:
      # The path where the message is stored in rocksmq
      lrucacheratio: 0.06 # rocksdb cache memory ratio
      rocksmqPageSize: 16777216 # default is 256 MB, 256 * 1024 * 1024 bytes, The size of each page of messages in rocksmq
      retentionTimeInMinutes: 1440 # default is 5 days, 5 * 24 * 60 minutes, The retention time of the message in rocksmq.
      retentionSizeInMB: 1024 # default is 8 GB, 8 * 1024 MB, The retention size of the message in rocksmq.
      compactionInterval: 86400 # 1 day, trigger rocksdb compaction every day to remove deleted data
    rootCoord:
      # changing this value will make the cluster unavailable
      dmlChannelNum: 4
    dataCoord:
      segment:
        maxSize: 128 # Maximum size of a segment in MB
        diskSegmentMaxSize: 256 # Maximun size of a segment in MB for collection which has Disk index

Expected Behavior

No response

Steps To Reproduce

Just deploy the standalone server with Helm

Milvus Log

https://wormhole.app/onvd4#OJRkw87z5RA7pAWlu2VmbQ

Anything else?

No response

xiaofan-luan commented 1 year ago

from the monitoring, I've seen that the segment number and producer num increasing. Seems there are collections created, data ingested, and index build and compaction will takes all of the CPUs, that seems to be make sense. Could you verify on that? using IVF might help a little bit better on index building states. Otherwise I think you will need large instance for the writes.

elonzh commented 1 year ago

Yes, every collection has a index,

index = {
            "index_type": "HNSW",
            "metric_type": "L2",
            "params": {"M": 8, "efConstruction": 64},
        }

We create an index after creating collection, insert a bunch of rows and call flush after that.

According to your documents,

By default, Milvus does not index a segment with less than 1,024 rows. To change this parameter, configure rootCoord.minSegmentSizeToEnableIndex in milvus.yaml.

Index building states still happen when the entities count less than 1024? Can we just create a collection without an index(or FLAT index) and search with brute search?

elonzh commented 1 year ago

I am curious why building indexes consume so much CPU, only several hundred collections are created and every collection has entities less than 50. The CPU usage is always high even if no collection is created.

xiaofan-luan commented 1 year ago

I am curious why building indexes consume so much CPU, only several hundred collections are created and every collection has entities less than 50. The CPU usage is always high even if no collection is created.

What if you try one collection with all the data and do scalar filitering? would that helps on your scenario? Thousands of the collections might bring troubles

elonzh commented 1 year ago

Thousands of collections might bring trouble. Like what?

xiaofan-luan commented 1 year ago

Thousands of collections might bring trouble. Like what?

did you want to syncup very quickly offline? Might be a little bit easier to explain. Each of collection has a message stream, and updating timetick every 100 ms, this will bring extra overhead for the whole systems if you have many collections. Also, I didn't really recommend to create very small collections, Milvus only support build index on segment with num entities > 1024. So you'd better just use FLAT index if you just need to search on thousand entities.

xiaofan-luan commented 1 year ago

if you are building on multi tenant solutions with 10k+ tenants, i thought logical partition is what you are looking for and we are actually working on it https://github.com/milvus-io/milvus/issues/23553

elonzh commented 1 year ago

Following your issue, right now if we create a single collection with many small segments, does search with tenant filtering only load the tenant segment? We just can't search before user data are flushed.

In other words, a single collection for multi-tenant is cost efficient? We do not want to load everything even just search data of a single tenant.

xiaofan-luan commented 1 year ago

Following your issue, right now if we create a single collection with many small segments, does search with tenant filtering only load the tenant segment? We just can't search before user data are flushed.

In other words, a single collection for multi-tenant is cost efficient? We do not want to load everything even just search data of a single tenant.

I thought you will still gonna to load everything into memory with one collection. But with the new logical partition design you don't need to search the whole collection but we will pick the right tenant to search .

If your tenant number is not huge then you might end up trying partition. But so far partition can not dynamically load/released and it will require you to release the whole collection and load it again.

Let me know your scenario because I definitely want to improve milvus in this multi tenant use case

elonzh commented 1 year ago

We are trying to build a product like ChatPDF.

Every PDF file will be split into chunks(usually less than 50 entities) and vector search is based on a single PDF file.

Seems Milvus is not the right solution for such a scenario.

xiaofan-luan commented 1 year ago

We are trying to build a product like ChatPDF.

Every PDF file will be split into chunks(usually less than 50 entities) and vector search is based on a single PDF file.

Seems Milvus is not the right solution for such a scenario.

Like traditional databases, you are not gonna to create a collection for 50 rows right? you should have a collection with scalar field pdfID and chunkID. when search you can do scalar filtering and you can retrieve for top50 most similar PDFs or chunks. thats' something you are looking for

xiaofan-luan commented 1 year ago

If you have 100m pdfs, are you gonna to create 100m collections? And how you gonna to search? 100m concurrent search on all collections?

elonzh commented 1 year ago

I didn't notice there is a collection number limits, we are in the evaluation stage🤣.

100m concurrent search on all collections?

I don't understand what you mean. Yes, filtering before a search is better for us. Maybe we should try products like pgvector because brute search is ok for small vector datasets.

xiaofan-luan commented 1 year ago

Even if you go for pgvector you are not gonna to create 1 table for each PDF right? The right pattern is to create 1 collection for all of documents. and search among the whole collection and find most related pdfs, I'm correct?

xiaofan-luan commented 1 year ago

If you take a look at how llama-index or langchain did you probably has more idea about what I'm saying. if you put pdf into different collection, then which collection you gonna to pick for search?

If you already know which pdf you want to search, then I guess pgvector or no matter what vector storage works for you. you can even retrieve all the embeddings from a traditional database and do brute force search in your memory. That shouldn't be a big deal

xiaofan-luan commented 1 year ago

Why people want to use vector db is usually then don't know what is most similar PDF is, then if you split data into multiple collections then you don't know what collection to search.

elonzh commented 1 year ago

Ok, I know what you mean now, we indeed know how to find such a collection for pdf(by md5sum).

The conversation with you is very helpful! Thank you very much.

xiaofan-luan commented 1 year ago

Anyway, we should improve the cpu usage with many collections in a cluster.

/assign @wayblink maybe with the timetick rpc implementation we can combined all the timetick rpc into one call

xiaofan-luan commented 1 year ago

Ok, I know what you mean now, we indeed know how to find such a collection for pdf(by md5sum).

The conversation with you is very helpful! Thank you very much.

~ Sure, If you which pdf to lookup ,things becomes much easier! Simply put all the data into a mysql or pg and retrieve all the vectors from tranditional database. In Milvus we can also query with expression but if that's the only operation you gonna to work with I would say tradition database should work better. Vector database is designed for ANN Search, find similarity among all embeddings

yanliang567 commented 1 year ago

/assign @wayblink /unassign

wayblink commented 1 year ago

Anyway, we should improve the cpu usage with many collections in a cluster.

/assign @wayblink maybe with the timetick rpc implementation we can combined all the timetick rpc into one call

@xiaofan-luan Yes, I already implement it like you said. Please take a review https://github.com/milvus-io/milvus/pull/23156

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

milvus-io / milvus