milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.53k stars 2.83k forks source link

[Bug]: search_iterator taking too much time to give response after some iteration #34816

Open Basir-mahmood opened 2 months ago

Basir-mahmood commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version: 2.3.13
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar 
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.4
- OS(Ubuntu or CentOS): Ubuntu
- Others: 
  -  Index-type : DiskANN

Current Behavior

I am using search_iterator, to get records in pages, and I am using milvus diskANN index. However, after some iterations, sometimes 2 or sometime 5 iterations on different collection, the search_iterator give response taking too much time, even after several minutes, no response is searched, and it keeps on halt. I am performing search on the partitions. The batch-size is 500, and limit is 10000. Moreover, the search params are "metric_type": "COSINE", "search_list": 1000.

For a collection with 2.6 million records, it give first iteration results in 0.03 milli seconds, and for the second iteration it gives results in 176 seconds ( almost 3 minutes). For another collection, with with almost 85 million records. It is giving top 5 iterations within milli seconds, but it get stuck afterwards, and I didnt get any response after even waiting for couple of minutes.

Expected Behavior

No response

Steps To Reproduce


test_collection = Collection(collection_name)

search_params =  {"metric_type": "COSINE",
                 "search_list": 1000,
                }

iterator = test_collection.search_iterator(
    batch_size=500,
    data=[embeddings],
    anns_field="embeddings",
    partition_names = ["parition_name"],
    param=search_params,
    limit= 10000,
    output_fields=["id"]
)
results = []
while True:
    result = iterator.next()
    if not result:
        iterator.close()
        break

Milvus Log

No response

Anything else?

No response

xiaofan-luan commented 2 months ago

please help on reproduce this issue on latest 2.3.18

I would expect to see longer response time after some iteration, but shouldn't be longer than 1 seconds, unless you have very small range

yanliang567 commented 2 months ago

/assign @Basir-mahmood please keep us posted if any updates on 2.3.18 or 2.4.6 /unassign

Basir-mahmood commented 2 months ago

Thanks for the quick reply. I want to ask that as the cluster is in the production how can I upgrade the cluster without affecting the user-experience. There is no release for helm for this version-2.3.18, so how can I define that ? I am trying to follow this, link: https://milvus.io/docs/upgrade_milvus_cluster-helm.md

yanliang567 commented 2 months ago

you can just update the milvus tag to the target milvus image tag in values.yaml

xiaofan-luan commented 2 months ago

@yanliang567 did we have similar test in house?

I'm also interested on seeing the perf behaiour when we iterated to later stage on all the index types

yanliang567 commented 2 months ago

we tested the batch size=1000/5000 to iterator all the entities in a 10_million dataset, the latency increased for 2-4 times to the first iterator, but still less than 800ms. I attached a screenshot of batch_size=1000 for reference: image

xiaofan-luan commented 2 months ago

we tested the batch size=1000/5000 to iterator all the entities in a 10_million dataset, the latency increased for 2-4 times to the first iterator, but still less than 800ms. I attached a screenshot of batch_size=1000 for reference: image

is this diskann?

yanliang567 commented 2 months ago

the test result about was on HNSW index. @Basir-mahmood does your querynodes running on nvme disk, as you are running with diskann index

Basir-mahmood commented 2 months ago

@yanliang567 yeah, the query nodes are running on the nvme disk.

Basir-mahmood commented 2 months ago

@yanliang567 For changing the milvus-image-version, should I just change the following configs

image:
  all:
    repository: milvusdb/milvus
    tag: v2.3.18
    pullPolicy: IfNotPresent

helm upgrade milvus-cluster-name zilliztech/milvus -n milvus --reuse-values -f config.yml

And can you please confirm that I would not have to change the milvus-helm version ?

yanliang567 commented 2 months ago

@yanliang567 For changing the milvus-image-version, should I just change the following configs

image:
  all:
    repository: milvusdb/milvus
    tag: v2.3.18
    pullPolicy: IfNotPresent

helm upgrade milvus-cluster-name zilliztech/milvus -n milvus --reuse-values -f config.yml

And can you please confirm that I would not have to change the milvus-helm version ?

yes, it is okay to update the tag

stale[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

Basir-mahmood commented 2 weeks ago

/reopen

sre-ci-robot commented 2 weeks ago

@Basir-mahmood: Reopened this issue.

In response to [this](https://github.com/milvus-io/milvus/issues/34816#issuecomment-2332288781): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
Basir-mahmood commented 2 weeks ago

I have updated the milvus cluster to milvus-2.3.20, and I am using pymilvus version 2.4.6. But still, the issue persist. The first iteration was completed in milliseconds and then the next one took around 19 minutes to run. There are approximately 105 million records in the collection.

xiaofan-luan commented 2 weeks ago

I don't think older version of diskann is good at iterations.

We have special implementation of iterator for HNSW. but I'm not pretty sure if we have similar optimization on diskann.

@yanliang567 can we track on the iteration speed of diskann for 2.4.10? @liliu-z could you confirm?

xiaofan-luan commented 2 weeks ago

我们测试了 batch size=1000/5000 来迭代 10_million 数据集中的所有实体,第一个迭代器的延迟增加了 2-4 倍,但仍然少于 800ms。我附上了 batch_size=1000 的屏幕截图以供参考: 图像

I think this is more of a result from HNSW

xiaofan-luan commented 2 weeks ago

by the way, what is the use case of your iterator? diskann is usually much slower on large topk. IVFSQ, IVFPQ could be waht you are looking for

liliu-z commented 2 weeks ago

The old version of DiskANN is not good at iteration/range search. The new version is in the progress and will be released very soon /assign @alwayslove2013

Basir-mahmood commented 2 weeks ago

@xiaofan-luan We have vector database currently having vecors around half a billion, this is one of the collection that I mentioned. There are multiple collections in the database. We are also concerned about the accuracy. Thats why we dont want to go towards quantization based approaches. And due to large sized data, we also want to avoid memory based index.

Thus, we want to stay with the DiskANN index, and we also want to extract all the records against a search_vector, which is greater than 16,000 ( max limit ). For the purpose, we were using search_iterator. Is there any other way around to perform it. Or is there any way possible that we define our timebases search_space for searching only on the newly inserted data ( e.g., last 48 hours)

xiaofan-luan commented 2 weeks ago
  1. you can probably use partitions and each partiton hold one datas data
  2. we are working on performance improvement for diskann search, this could potentially solve your problem
Basir-mahmood commented 1 week ago

@xiaofan-luan The maximum number of partitions is limited to 1024. If we would be storing on date, then this would limit us. I guess we have to go with the option of storing data weekly. Is there any other method that would be more appropriate in your opinion?

xiaofan-luan commented 1 week ago

you can actually change the limit to 4096, I thought it's a config can be changed