Open yanliang567 opened 6 months ago
/assign @liliu-z /unassign
changed a bit for my test: use FLAT index for the first search and then rebuild the collection with test target index type, the results is a bit better --50%, but still not perfect.
Try enlarge range search param
/assign @yanliang567
/unassign
Try enlarge range search param
range search results are already better than search results.
I understand the approaches are different, but I also think we should align the results between search and range search, or it is very confusing to users. @xiaofan-luan @liliu-z
/assign @xiaofan-luan
besides, we need to align the results in search, range_search, search with pagination, grouping search, etc.
Try enlarge range search param
range search results are already better than search results. I understand the approaches are different, but I also think we should align the results between search and range search, or it is very confusing to users. @xiaofan-luan @liliu-z /assign @xiaofan-luan
@yanliang567 I don't agree about the result alignment
Also, we have iterator, search, range search, I didn't see a strong signal that we need to align all the results.
@liliu-z From a user’s point of view, he does not care about the algos, he is just confusing about the results are not fitting for each other, and he does not know which results are accurate
I think there is no necessity to sync all the search result. But it does make sense to have at least 90% similar result. Otherwise this may indicate a huge accuracy loss? @yanliang567 @liliu-z
I think there is no necessity to sync all the search result. But it does make sense to have at least 90% similar result. Otherwise this may indicate a huge accuracy loss? @yanliang567 @liliu-z
Yes, we have a param that can be tuned to get better recall
let's set the goal to be:
Then we need some improvement here in range search. /assign @liliu-z
/assign
growing index still not support fp16/bf16/binary vector.
plz use this code try again:
@pytest.mark.tags(CaseLabel.L0)
@pytest.mark.parametrize("vector_data_type", ct.all_float_vector_types)
@pytest.mark.parametrize("with_growing", [True, False])
def test_range_search_default(self, index_type, metric, vector_data_type, with_growing):
"""
target: verify the range search returns correct results
method: 1. create collection, insert 8000 vectors,
2. search with topk=1000
3. range search from the 30th-330th distance as filter
4. verified the range search results is same as the search results in the range
"""
counter = 0
collection_w = self.init_collection_general(prefix, auto_id=True, insert_data=False, is_index=False,
vector_data_type=vector_data_type, with_json=False)[0]
nb = 2000
for i in range(3):
data = cf.gen_general_default_list_data(nb=nb, auto_id=True, start=counter,
vector_data_type=vector_data_type, with_json=False)
collection_w.insert(data)
counter = counter + nb
collection_w.flush()
_index_params = {"index_type": "FLAT", "metric_type": metric, "params": {}}
collection_w.create_index(ct.default_float_vec_field_name, index_params=_index_params)
collection_w.load()
if with_growing is True:
# add some growing segments
for _ in range(2):
data = cf.gen_general_default_list_data(nb=nb, auto_id=True, start=counter,
vector_data_type=vector_data_type, with_json=False)
collection_w.insert(data)
counter = counter + nb
search_params = {"params": {}}
nq = 1
search_vectors = cf.gen_vectors(nq, ct.default_dim, vector_data_type=vector_data_type)
search_res = collection_w.search(search_vectors, default_search_field,
search_params, limit=1000)[0]
assert len(search_res[0].ids) == 1000
log.debug(f"search topk=1000 returns {len(search_res[0].ids)}")
check_topk = 300
check_from = 30
ids = search_res[0].ids[check_from:check_from + check_topk]
radius = search_res[0].distances[check_from + check_topk]
range_filter = search_res[0].distances[check_from]
# rebuild the collection with test target index
collection_w.release()
collection_w.indexes[0].drop()
_index_params = {"index_type": index_type, "metric_type": metric,
"params": cf.get_index_params_params(index_type)}
collection_w.create_index(ct.default_float_vec_field_name, index_params=_index_params)
collection_w.load()
params = cf.get_search_params_params(index_type)
params.update({"radius": radius, "range_filter": range_filter})
if index_type == "HNSW":
params.update({"ef": check_topk+100})
if index_type == "IVF_PQ":
params.update({"max_empty_result_buckets": 100})
range_search_params = {"params": params}
range_res = collection_w.search(search_vectors, default_search_field,
range_search_params, limit=check_topk)[0]
range_ids = range_res[0].ids
# assert len(range_ids) == check_topk
log.debug(f"cqy: index params: _index_params{_index_params}")
log.debug(f"cqy: knn ids={ids}")
log.debug(f"cqy: knn dis={search_res[0].distances[check_from:check_from + check_topk]}")
log.debug(f"cqy: range_search ids={range_res[0].ids}")
log.debug(f"cqy: range_search dis={range_res[0].distances}")
log.debug(f"cqy: range search radius={radius}, range_filter={range_filter}, range results num: {len(range_ids)}")
hit_rate = round(len(set(ids).intersection(set(range_ids))) / len(set(ids)), 2)
log.debug(f"cqy: range search results with growing {with_growing} hit rate: {hit_rate}")
assert hit_rate >= 0.2 # issue #32630 to improve the accuracy
test case in milvus may has some problem. when i = 1, use current index to generate gt, instead of flat.
some errors in the test script:
row_cnt = 0
nb = 2000
for i in range(3):
data = cf.gen_general_default_list_data(nb=nb, auto_id=True, start=row_cnt,
vector_data_type=vector_data_type, with_json=False)
collection_w.insert(data)
row_cnt += nb
this test script need update @yanliang567
/assign @yanliang567
updating and rerunning the tests...
Is there an existing issue for this?
Environment
Current Behavior
My case is
expect: 300 results are same in search_res and range_search_results actual: only 51 of range_search_results hit the search_res[50:350], less than 20%
Expected Behavior
if it is hard to get 100% same results with search, it should be >90%
Steps To Reproduce
No response
Milvus Log
test case:
Anything else?
The same situation with IVF_*, SCCAN Index. HNSW has a better result of 80% hit rate. Checking the first 10 results, you will see the range_res have better distances