Closed elonzh closed 11 months ago
from the monitoring, I've seen that the segment number and producer num increasing. Seems there are collections created, data ingested, and index build and compaction will takes all of the CPUs, that seems to be make sense. Could you verify on that? using IVF might help a little bit better on index building states. Otherwise I think you will need large instance for the writes.
Yes, every collection has a index,
index = {
"index_type": "HNSW",
"metric_type": "L2",
"params": {"M": 8, "efConstruction": 64},
}
We create an index after creating collection, insert a bunch of rows and call flush
after that.
According to your documents,
By default, Milvus does not index a segment with less than 1,024 rows. To change this parameter, configure rootCoord.minSegmentSizeToEnableIndex in milvus.yaml.
Index building states still happen when the entities count less than 1024?
Can we just create a collection without an index(or FLAT
index) and search with brute search?
I am curious why building indexes consume so much CPU, only several hundred collections are created and every collection has entities less than 50. The CPU usage is always high even if no collection is created.
I am curious why building indexes consume so much CPU, only several hundred collections are created and every collection has entities less than 50. The CPU usage is always high even if no collection is created.
What if you try one collection with all the data and do scalar filitering? would that helps on your scenario? Thousands of the collections might bring troubles
Thousands of collections might bring trouble. Like what?
Thousands of collections might bring trouble. Like what?
did you want to syncup very quickly offline? Might be a little bit easier to explain. Each of collection has a message stream, and updating timetick every 100 ms, this will bring extra overhead for the whole systems if you have many collections. Also, I didn't really recommend to create very small collections, Milvus only support build index on segment with num entities > 1024. So you'd better just use FLAT index if you just need to search on thousand entities.
if you are building on multi tenant solutions with 10k+ tenants, i thought logical partition is what you are looking for and we are actually working on it https://github.com/milvus-io/milvus/issues/23553
Following your issue, right now if we create a single collection with many small segments, does search with tenant filtering only load the tenant segment? We just can't search before user data are flushed.
In other words, a single collection for multi-tenant is cost efficient? We do not want to load everything even just search data of a single tenant.
Following your issue, right now if we create a single collection with many small segments, does search with tenant filtering only load the tenant segment? We just can't search before user data are flushed.
In other words, a single collection for multi-tenant is cost efficient? We do not want to load everything even just search data of a single tenant.
I thought you will still gonna to load everything into memory with one collection. But with the new logical partition design you don't need to search the whole collection but we will pick the right tenant to search .
If your tenant number is not huge then you might end up trying partition. But so far partition can not dynamically load/released and it will require you to release the whole collection and load it again.
Let me know your scenario because I definitely want to improve milvus in this multi tenant use case
We are trying to build a product like ChatPDF.
Every PDF file will be split into chunks(usually less than 50 entities) and vector search is based on a single PDF file.
Seems Milvus is not the right solution for such a scenario.
We are trying to build a product like ChatPDF.
Every PDF file will be split into chunks(usually less than 50 entities) and vector search is based on a single PDF file.
Seems Milvus is not the right solution for such a scenario.
Like traditional databases, you are not gonna to create a collection for 50 rows right? you should have a collection with scalar field pdfID and chunkID. when search you can do scalar filtering and you can retrieve for top50 most similar PDFs or chunks. thats' something you are looking for
If you have 100m pdfs, are you gonna to create 100m collections? And how you gonna to search? 100m concurrent search on all collections?
I didn't notice there is a collection number limits, we are in the evaluation stage🤣.
100m concurrent search on all collections?
I don't understand what you mean. Yes, filtering before a search is better for us. Maybe we should try products like pgvector because brute search is ok for small vector datasets.
Even if you go for pgvector you are not gonna to create 1 table for each PDF right? The right pattern is to create 1 collection for all of documents. and search among the whole collection and find most related pdfs, I'm correct?
If you take a look at how llama-index or langchain did you probably has more idea about what I'm saying. if you put pdf into different collection, then which collection you gonna to pick for search?
If you already know which pdf you want to search, then I guess pgvector or no matter what vector storage works for you. you can even retrieve all the embeddings from a traditional database and do brute force search in your memory. That shouldn't be a big deal
Why people want to use vector db is usually then don't know what is most similar PDF is, then if you split data into multiple collections then you don't know what collection to search.
Ok, I know what you mean now, we indeed know how to find such a collection for pdf(by md5sum).
The conversation with you is very helpful! Thank you very much.
Anyway, we should improve the cpu usage with many collections in a cluster.
/assign @wayblink maybe with the timetick rpc implementation we can combined all the timetick rpc into one call
Ok, I know what you mean now, we indeed know how to find such a collection for pdf(by md5sum).
The conversation with you is very helpful! Thank you very much.
~ Sure, If you which pdf to lookup ,things becomes much easier! Simply put all the data into a mysql or pg and retrieve all the vectors from tranditional database. In Milvus we can also query with expression but if that's the only operation you gonna to work with I would say tradition database should work better. Vector database is designed for ANN Search, find similarity among all embeddings
/assign @wayblink /unassign
Anyway, we should improve the cpu usage with many collections in a cluster.
/assign @wayblink maybe with the timetick rpc implementation we can combined all the timetick rpc into one call
@xiaofan-luan Yes, I already implement it like you said. Please take a review https://github.com/milvus-io/milvus/pull/23156
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
Is there an existing issue for this?
Environment
Current Behavior
I changed the milvus config and reset the data, still not working.
Expected Behavior
No response
Steps To Reproduce
Milvus Log
https://wormhole.app/onvd4#OJRkw87z5RA7pAWlu2VmbQ
Anything else?
No response