milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.67k stars 2.85k forks source link

[Bug]: Using GPU_IVF_FLAT and L2 to search wtih the same parameters as IVF_FLAT atfer two or more data insertions brings different results #36609

Closed qwevdb closed 3 hours ago

qwevdb commented 6 hours ago

Is there an existing issue for this?

Environment

- Milvus version: milvus v2.4.12-gpu
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.4.5
- OS(Ubuntu or CentOS): Ubuntu 24.04 LTS
- CPU/Memory: Intel Core i7-11700 / 64G
- GPU: NVIDIA GeForce RTX 4090
- Others:

Current Behavior

The result of IVF_FLAT index and L2 metric atfer two or more data insertions is different from GPU_IVF_FLAT index and L2 metric with the same parameters. If inserting data only once or merging two or more data insertions, the results are the same.

Expected Behavior

Both IVF_FLAT and GPU_IVF_FLAT index with L2 metric and the same parameters can produce the same results.

Steps To Reproduce

  1. Create an IVF_FLAT index with L2 metric in the collection.
  2. Insert data into collection twice.
  3. Search
import time
from pymilvus import Collection, connections, FieldSchema, CollectionSchema, DataType, utility
import numpy as np

FLOAT_MAX = 5000
DATA_INT_MAX = 100
categories = ["green", "blue", "yellow", "red", "black", "white", "purple", "pink", "orange", "brown", "grey"] 

numpy_random = np.random.default_rng(87051)
alias = "bench"
collection_name = "Benchmark"
client = connections.connect(
    alias=alias,
    host="localhost",
    port="19530"
)
if utility.has_collection(collection_name, using=alias):
    collection = Collection(name=collection_name, using=alias)
    collection.drop()
    time.sleep(2)  

dim = 565
id = FieldSchema(name='id', dtype=DataType.INT64, is_primary=True)
vector = FieldSchema(name='vector', dtype=DataType.FLOAT_VECTOR, dim=dim)
field_1 = FieldSchema(name='field_1', dtype=DataType.INT64)
field_2 = FieldSchema(name='field_2', dtype=DataType.VARCHAR, max_length=255)
field_3 = FieldSchema(name='field_3', dtype=DataType.INT64)
field_4 = FieldSchema(name='field_4', dtype=DataType.VARCHAR, max_length=255)
fields = [id, vector, field_1, field_2, field_3, field_4]
schema = CollectionSchema(fields=fields, description=alias)
collection = Collection(
    name=collection_name,
    schema=schema,
    using=alias,
)
index_params = {'index_type': 'IVF_FLAT', 'params': {'nlist': 44678, 'max_empty_result_buckets': 59598}, 'metric_type': 'L2'}
# index_params = {'index_type': 'GPU_IVF_FLAT', 'params': {'nlist': 44678, 'max_empty_result_buckets': 59598}, 'metric_type': 'L2'}
collection.create_index("vector", index_params, timeout=100)

dataset = []
number = 2144
for i in range(0,number + 0):
    vector = numpy_random.random((1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_2': numpy_random.choice(categories),
        'field_3': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_4': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

dataset = []
number = 24
for i in range(2144,number + 2144):
    vector = numpy_random.random((1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_2': numpy_random.choice(categories),
        'field_3': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_4': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

vector = numpy_random.random((1, dim))
query_vector = list(vector[0])
iterator = collection.search_iterator(
       data=[query_vector],
       anns_field="vector",
       param={"metric_type": "L2", "params": {'nprobe': 38041}},
       limit=3,
       expr='((field_4 not in ["pink", "blue"] || field_3 != 94) || field_2 <= "pink")',
       batch_size=299,
       timeout=100
    )
res1 = []
while True:
    result = iterator.next()
    if not result:
        iterator.close()
        break

    res1.extend(result)
collection.release()
collection.drop_index()
collection.flush()
print(res1)
collection.drop()

result:

[id: 1243, distance: 78.61210632324219, entity: {}, id: 1032, distance: 79.29000854492188, entity: {}, id: 1823, distance: 79.78858947753906, entity: {}]
  1. Change 'index_type': 'IVF_FLAT' to 'index_type': 'GPU_IVF_FLAT' in index_params and run again.

result:

[id: 2150, distance: 85.07849884033203, entity: {}, id: 1781, distance: 85.67111206054688, entity: {}, id: 1986, distance: 86.02687072753906, entity: {}]
  1. If inserting data only once or merging two data insertions, the results of using IVF_FLAT and GPU_IVF_FLAT index are the same.
yanliang567 commented 3 hours ago

dup to #36588 36608

qwevdb commented 1 hour ago

In addition to the difference of index and metric between this issue and issue #36608, there is a small difference. The random seed has to be fixed at a specific value (87051) to trigger different results in this issue and issue #36610 while issue #36608 and #36607 don't have to.

xiaofan-luan commented 28 minutes ago

In addition to the difference of index and metric between this issue and issue #36608, there is a small difference. The random seed has to be fixed at a specific value (87051) to trigger different results in this issue and issue #36610 while issue #36608 and #36607 don't have to.

We don't recommend to use random vector for test.

One of the possible reason of the difference is growing index and compaction Segment