milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.66k stars 2.92k forks source link

[Bug]: [new_indexes] The searched results become less than limit * group_size after creating the new HNSW indexes after groupby search with group size #37601

Closed binbinlv closed 2 days ago

binbinlv commented 4 days ago

Is there an existing issue for this?

Environment

- Milvus version: master latest
- Deployment mode(standalone or cluster): both
- MQ type(rocksmq, pulsar or kafka):    all
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus latest
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The searched results become less after creating the new HNSW indexes after groupby search with group size


[pytest : test] self = <test_mix_scenes.TestGroupSearchNewHNSWIndex object at 0x7fd6a7bfc760>

[pytest : test] group_by_field = 'VARCHAR'

[pytest : test] 

[pytest : test]     @pytest.mark.tags(CaseLabel.L0)

[pytest : test]     @pytest.mark.parametrize("group_by_field", [DataType.VARCHAR.name, "varchar_inverted"])

[pytest : test]     def test_search_group_size_new_hnsw_index(self, group_by_field):

[pytest : test]         """

[pytest : test]         target:

[pytest : test]             1. search on 4 different float vector fields with group by varchar field with group size

[pytest : test]         verify results entity = limit * group_size  and group size is full if group_strict_size is True

[pytest : test]         verify results group counts = limit if group_strict_size is False

[pytest : test]         """

[pytest : test]         nq = 2

[pytest : test]         limit = 50

[pytest : test]         group_size = 5

[pytest : test]         for j in range(len(self.vector_fields)):

[pytest : test]             search_vectors = cf.gen_vectors(nq, dim=self.dims[j], vector_data_type=self.vector_fields[j])

[pytest : test]             search_params = {"params": cf.get_search_params_params(self.index_types[j])}

[pytest : test]             # when group_strict_size=true, it shall return results with entities = limit * group_size

[pytest : test]             res1 = self.collection_wrap.search(data=search_vectors, anns_field=self.vector_fields[j],

[pytest : test]                                                param=search_params, limit=limit,

[pytest : test]                                                group_by_field=group_by_field,

[pytest : test]                                                group_size=group_size, group_strict_size=True,

[pytest : test]                                                output_fields=[group_by_field])[0]

[pytest : test]             for i in range(nq):

[pytest : test] >               assert len(res1[i]) == limit * group_size

[pytest : test] E               assert 63 == (50 * 5)

Expected Behavior

The searched results are equal with limit * group_size

Steps To Reproduce

    @pytest.mark.tags(CaseLabel.L0)
    @pytest.mark.parametrize("group_by_field", [DataType.VARCHAR.name, "varchar_inverted"])
    def test_search_group_size_new_hnsw_index(self, group_by_field):
        """
        target:
            1. search on 4 different float vector fields with group by varchar field with group size
        verify results entity = limit * group_size  and group size is full if group_strict_size is True
        verify results group counts = limit if group_strict_size is False
        """
        nq = 2
        limit = 50
        group_size = 5
        for j in range(len(self.vector_fields)):
            search_vectors = cf.gen_vectors(nq, dim=self.dims[j], vector_data_type=self.vector_fields[j])
            search_params = {"params": cf.get_search_params_params(self.index_types[j])}
            # when group_strict_size=true, it shall return results with entities = limit * group_size
            res1 = self.collection_wrap.search(data=search_vectors, anns_field=self.vector_fields[j],
                                               param=search_params, limit=limit,
                                               group_by_field=group_by_field,
                                               group_size=group_size, group_strict_size=True,
                                               output_fields=[group_by_field])[0]
            for i in range(nq):
                assert len(res1[i]) == limit * group_size
                for l in range(limit):
                    group_values = []
                    for k in range(group_size):
                        group_values.append(res1[i][l*group_size+k].fields.get(group_by_field))
                    assert len(set(group_values)) == 1

            # when group_strict_size=false, it shall return results with group counts = limit
            res1 = self.collection_wrap.search(data=search_vectors, anns_field=self.vector_fields[j],
                                               param=search_params, limit=limit,
                                               group_by_field=group_by_field,
                                               group_size=group_size, group_strict_size=False,
                                               output_fields=[group_by_field])[0]
            for i in range(nq):
                group_values = []
                for l in range(len(res1[i])):
                    group_values.append(res1[i][l].fields.get(group_by_field))
                assert len(set(group_values)) == limit

Milvus Log

test log: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-37136/14/pipeline/

milvus log: artifacts-milvus-standalone-ms-37136-14-py-pr-37136-14-e2e-logs.tar.gz

Anything else?

collection name: TestGroupSearchNewHNSWIndex_GflxWGH7 index on the searched field is: {'index_type': 'FAISS_HNSW_SQ', 'params': {'sq_type': 'SQ8'}, 'metric_type': 'IP'}

No response

foxspy commented 4 days ago

/assign

yanliang567 commented 2 days ago

could be caused by the search params change from group_strict_size to strict_group_size

binbinlv commented 2 days ago

Verified and fixed:

when change "group_strict_size" to "strict_group_size", it passes.

milvus: master-20241114-1304b405-amd64 pymilvus: 2.5.0rc119