[Bug]: Using GPU_IVF_FLAT and L2 to search wtih the same parameters as IVF_FLAT atfer two or more data insertions brings different results

qwevdb commented 6 hours ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: milvus v2.4.12-gpu
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.4.5
- OS(Ubuntu or CentOS): Ubuntu 24.04 LTS
- CPU/Memory: Intel Core i7-11700 / 64G
- GPU: NVIDIA GeForce RTX 4090
- Others:

Current Behavior

The result of IVF_FLAT index and L2 metric atfer two or more data insertions is different from GPU_IVF_FLAT index and L2 metric with the same parameters. If inserting data only once or merging two or more data insertions, the results are the same.

Expected Behavior

Both IVF_FLAT and GPU_IVF_FLAT index with L2 metric and the same parameters can produce the same results.

Steps To Reproduce

Create an IVF_FLAT index with L2 metric in the collection.
Insert data into collection twice.
Search

import time
from pymilvus import Collection, connections, FieldSchema, CollectionSchema, DataType, utility
import numpy as np

FLOAT_MAX = 5000
DATA_INT_MAX = 100
categories = ["green", "blue", "yellow", "red", "black", "white", "purple", "pink", "orange", "brown", "grey"] 

numpy_random = np.random.default_rng(87051)
alias = "bench"
collection_name = "Benchmark"
client = connections.connect(
    alias=alias,
    host="localhost",
    port="19530"
)
if utility.has_collection(collection_name, using=alias):
    collection = Collection(name=collection_name, using=alias)
    collection.drop()
    time.sleep(2)  

dim = 565
id = FieldSchema(name='id', dtype=DataType.INT64, is_primary=True)
vector = FieldSchema(name='vector', dtype=DataType.FLOAT_VECTOR, dim=dim)
field_1 = FieldSchema(name='field_1', dtype=DataType.INT64)
field_2 = FieldSchema(name='field_2', dtype=DataType.VARCHAR, max_length=255)
field_3 = FieldSchema(name='field_3', dtype=DataType.INT64)
field_4 = FieldSchema(name='field_4', dtype=DataType.VARCHAR, max_length=255)
fields = [id, vector, field_1, field_2, field_3, field_4]
schema = CollectionSchema(fields=fields, description=alias)
collection = Collection(
    name=collection_name,
    schema=schema,
    using=alias,
)
index_params = {'index_type': 'IVF_FLAT', 'params': {'nlist': 44678, 'max_empty_result_buckets': 59598}, 'metric_type': 'L2'}
# index_params = {'index_type': 'GPU_IVF_FLAT', 'params': {'nlist': 44678, 'max_empty_result_buckets': 59598}, 'metric_type': 'L2'}
collection.create_index("vector", index_params, timeout=100)

dataset = []
number = 2144
for i in range(0,number + 0):
    vector = numpy_random.random((1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_2': numpy_random.choice(categories),
        'field_3': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_4': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

dataset = []
number = 24
for i in range(2144,number + 2144):
    vector = numpy_random.random((1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_2': numpy_random.choice(categories),
        'field_3': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_4': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

vector = numpy_random.random((1, dim))
query_vector = list(vector[0])
iterator = collection.search_iterator(
       data=[query_vector],
       anns_field="vector",
       param={"metric_type": "L2", "params": {'nprobe': 38041}},
       limit=3,
       expr='((field_4 not in ["pink", "blue"] || field_3 != 94) || field_2 <= "pink")',
       batch_size=299,
       timeout=100
    )
res1 = []
while True:
    result = iterator.next()
    if not result:
        iterator.close()
        break

    res1.extend(result)
collection.release()
collection.drop_index()
collection.flush()
print(res1)
collection.drop()

result:

[id: 1243, distance: 78.61210632324219, entity: {}, id: 1032, distance: 79.29000854492188, entity: {}, id: 1823, distance: 79.78858947753906, entity: {}]

Change 'index_type': 'IVF_FLAT' to 'index_type': 'GPU_IVF_FLAT' in index_params and run again.

result:

[id: 2150, distance: 85.07849884033203, entity: {}, id: 1781, distance: 85.67111206054688, entity: {}, id: 1986, distance: 86.02687072753906, entity: {}]

If inserting data only once or merging two data insertions, the results of using IVF_FLAT and GPU_IVF_FLAT index are the same.

yanliang567 commented 3 hours ago

dup to #36588 36608

qwevdb commented 1 hour ago

In addition to the difference of index and metric between this issue and issue #36608, there is a small difference. The random seed has to be fixed at a specific value (87051) to trigger different results in this issue and issue #36610 while issue #36608 and #36607 don't have to.

xiaofan-luan commented 28 minutes ago

In addition to the difference of index and metric between this issue and issue #36608, there is a small difference. The random seed has to be fixed at a specific value (87051) to trigger different results in this issue and issue #36610 while issue #36608 and #36607 don't have to.

We don't recommend to use random vector for test.

One of the possible reason of the difference is growing index and compaction Segment

milvus-io / milvus