milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.49k stars 2.83k forks source link

[Bug]: Search returns empty for superstructure and substructure metrics #18283

Closed binbinlv closed 2 years ago

binbinlv commented 2 years ago

Is there an existing issue for this?

Environment

- Milvus version: latest
- Deployment mode(standalone or cluster):both
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.1.0.dev97
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Search returns empty for superstructure and substructure metrics:

[2022-07-15 04:14:01 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[b'x> \xeeU\xc8}\xa3$\x1f,~\xfc\xb2\x97.', b'k\xaf\xf3\x82\xb1p\x9el\xb6\xb1\x8b\xb3\xfb\xe5Pl'], 'binary_vector', {'metric_type': 'SUPERSTRUCTURE', 'params': {'nprobe': 10}}, 10, 'int64 >= 0', None, None, 20, -1], kwargs: {'_async': False, 'travel_timestamp': 434597641679536134} (api_request.py:55)
[2022-07-15 04:14:01 - DEBUG - ci_test]: (api_response) : ['[]', '[]']  (api_request.py:27)

Expected Behavior

Returns topK successfully

Steps To Reproduce

    @pytest.mark.tags(CaseLabel.L2)
    @pytest.mark.parametrize("index", ["BIN_FLAT"])
    def test_search_binary_substructure_flat_index(self, nq, dim, auto_id, _async, index, is_flush):
        """
        target: search binary_collection, and check the result: distance
        method: compare the return distance value with value computed with SUBSTRUCTURE
        expected: the return distance equals to the computed value
        """
        # 1. initialize with binary data
        collection_w, _, binary_raw_vector, insert_ids, time_stamp = self.init_collection_general(prefix, True, 2,
                                                                                                  is_binary=True,
                                                                                                  auto_id=auto_id,
                                                                                                  dim=dim,
                                                                                                  is_index=True,
                                                                                                  is_flush=is_flush)[0:5]
        # 2. create index
        default_index = {"index_type": index, "params": {"nlist": 128}, "metric_type": "SUBSTRUCTURE"}
        collection_w.create_index("binary_vector", default_index)
        collection_w.load()
        # 3. compute the distance
        query_raw_vector, binary_vectors = cf.gen_binary_vectors(3000, dim)
        distance_0 = cf.substructure(query_raw_vector[0], binary_raw_vector[0])
        distance_1 = cf.substructure(query_raw_vector[0], binary_raw_vector[1])
        # 4. search and compare the distance
        search_params = {"metric_type": "SUBSTRUCTURE", "params": {"nprobe": 10}}
        res = collection_w.search(binary_vectors[:nq], "binary_vector",
                                  search_params, default_limit, "int64 >= 0",
                                  _async=_async,
                                  travel_timestamp=time_stamp,
                                  check_task=CheckTasks.check_search_results,
                                  check_items={"nq": nq,
                                               "ids": insert_ids,
                                               "limit": 2,
                                               "_async": _async})[0]
        if _async:
            res.done()
            res = res.result()
        assert abs(res[0].distances[0] - min(distance_0, distance_1)) <= epsilon

Milvus Log

No response

Anything else?

No response

binbinlv commented 2 years ago

Discussed with @yhmo : for superstructure and substructure metrics, it has the following characteristics: (1) The returned limit(topK) are impacted by dimension (dim) of data. (2) Searched topK is smaller than set limit when dim is large (3) it does not support "BIN_IVF_FLAT" index

And when set dim=8, it is OK.

binbinlv commented 2 years ago

@yhmo And I want to confirm another thing, from the following, the distances of all the returned vectors are 0, it is expected for this metrics? Thanks.

[2022-07-15 10:31:23,148 - DEBUG - ci_test]: (api_request)  : [Collection.search] args: [[b'\x1e'], 'binary_vector', {'metric_type': 'SUPERSTRUCTURE', 'params': {'nprobe': 10}}, 10, 'int64 >= 0', None, None, 20, -1], kwargs: {'_async': False} (api_request.py:56)
[2022-07-15 10:31:23,268 - DEBUG - ci_test]: (api_response) : ["['(distance: 0.0, id: 434597479163442700)', '(distance: 0.0, id: 434597479163442745)', '(distance: 0.0, id: 434597479163442795)', '(distance: 0.0, id: 434597479163442859)', '(distance: 0.0, id: 434597479163442861)', '(distance: 0.0, id: 434597479163442971)', '(distance: 0.0, id: 434597479163442992......  (api_request.py:31)
longjiquan commented 2 years ago

/unassign /assign @yhmo

yhmo commented 2 years ago
query_raw_vector

I think this behavior is expected. superstruct/substruct metric only has two state: true or false, true is matched, false is unmatched. That means the result distance only has two kinds value, 0 or 1, we use 0 to represent matched, 1 represent unmatched.

binbinlv commented 2 years ago

OK,work as design, close.