milvus-io / pymilvus

Python SDK for Milvus.
Apache License 2.0
1.02k stars 324 forks source link

[QUESTION]: How to filter sparse vector field (metric_type=IP) 如何对sparse field 进行range查询? #2057

Open weiminw opened 6 months ago

weiminw commented 6 months ago

Is there an existing issue for this?

What is your question?

现在有一个问题,就是使用 search 对 sparse 字段进行搜索的时候, 如何过滤掉某个阈值下的数据呢? sparse 采用的内积,但是milvus 里面使用 search 参数 {"radius":0.3, "range_filter": 1.0} 好像不生效.

Anything else?

No response

XuanYang-cn commented 4 months ago

/assign @zhengbuqian

weiminw commented 3 months ago

/assign @zhengbuqian

您好,不知道该问题是否有解?

weiminw commented 2 weeks ago

你好,是否该问题已经解决了?

zhengbuqian commented 2 weeks ago

@weiminw they should work now.

with https://github.com/milvus-io/pymilvus/blob/master/examples/hello_sparse.py I tried:

log(fmt.format("Start searching based on vector similarity"))
vectors_to_search = entities[-1][-1:]
search_params = {
    "metric_type": "IP",
    "params": {
    }
}

start_time = time.time()
result = hello_sparse.search(vectors_to_search, "embeddings", search_params, limit=30, output_fields=["pk"])
end_time = time.time()

for hits in result:
    for hit in hits:
        print(f"hit: {hit}")
log(search_latency_fmt.format(end_time - start_time))

search_params = {
    "metric_type": "IP",
    "params": {
        "radius": 0.9,
        "range_filter": 0.94,
    }
}

start_time = time.time()
result = hello_sparse.search(vectors_to_search, "embeddings", search_params, limit=30, output_fields=["pk"])
end_time = time.time()

for hits in result:
    for hit in hits:
        print(f"hit: {hit}")
log(search_latency_fmt.format(end_time - start_time))

and get:

2024-10-14 17:33:15 === Start searching based on vector similarity ===
hit: id: 453218920656710096, distance: 1.4065842628479004, entity: {'pk': '453218920656710096'}
hit: id: 453218920656710170, distance: 1.2275768518447876, entity: {'pk': '453218920656710170'}
hit: id: 453218920656709750, distance: 1.1699843406677246, entity: {'pk': '453218920656709750'}
hit: id: 453218920656710122, distance: 1.0741833448410034, entity: {'pk': '453218920656710122'}
hit: id: 453218920656709754, distance: 0.967951774597168, entity: {'pk': '453218920656709754'}
hit: id: 453218920656710099, distance: 0.9418796896934509, entity: {'pk': '453218920656710099'}
hit: id: 453218920656709841, distance: 0.9369253516197205, entity: {'pk': '453218920656709841'}
hit: id: 453218920656710247, distance: 0.9361521005630493, entity: {'pk': '453218920656710247'}
hit: id: 453218920656709512, distance: 0.9086723923683167, entity: {'pk': '453218920656709512'}
hit: id: 453218920656709854, distance: 0.8783086538314819, entity: {'pk': '453218920656709854'}
hit: id: 453218920656709774, distance: 0.8762619495391846, entity: {'pk': '453218920656709774'}
# ... manually omitting more results
2024-10-14 17:33:15 search latency = 0.4008s
hit: id: 453218920656709841, distance: 0.9369253516197205, entity: {'pk': '453218920656709841'}
hit: id: 453218920656710247, distance: 0.9361521005630493, entity: {'pk': '453218920656710247'}
hit: id: 453218920656709512, distance: 0.9086723923683167, entity: {'pk': '453218920656709512'}
2024-10-14 17:33:16 search latency = 0.1963s

with radius and range_filter only 3 results are returned.