milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.67k stars 2.85k forks source link

[Bug]: Using GPU_IVF_PQ and L2 to search wtih the same parameters as IVF_PQ atfer two data insertions brings different results #36610

Closed qwevdb closed 3 hours ago

qwevdb commented 6 hours ago

Is there an existing issue for this?

Environment

- Milvus version: milvus v2.4.12-gpu
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.4.5
- OS(Ubuntu or CentOS): Ubuntu 24.04 LTS
- CPU/Memory: Intel Core i7-11700 / 64G
- GPU: NVIDIA GeForce RTX 4090
- Others:

Current Behavior

The result of IVF_PQ index and L2 metric atfer two data insertions is different from GPU_IVF_PQ index and L2 metric with the same parameters. If inserting data only once or merging two data insertions, the results are the same.

Expected Behavior

Both IVF_PQ and GPU_IVF_PQ index with L2 metric and the same parameters can produce the same results.

Steps To Reproduce

  1. Create an IVF_PQ index with L2 metric in the collection.
  2. Insert data into collection twice.
  3. Search
import time
from pymilvus import Collection, connections, FieldSchema, CollectionSchema, DataType, utility
import numpy as np

FLOAT_MAX = 5000
DATA_INT_MAX = 100
categories = ["green", "blue", "yellow", "red", "black", "white", "purple", "pink", "orange", "brown", "grey"] 

numpy_random = np.random.default_rng(90878)
alias = "bench"
collection_name = "Benchmark"
client = connections.connect(
    alias=alias,
    host="localhost",
    port="19530"
)
if utility.has_collection(collection_name, using=alias):
    collection = Collection(name=collection_name, using=alias)
    collection.drop()
    time.sleep(2)  

dim = 275
id = FieldSchema(name='id', dtype=DataType.INT64, is_primary=True)
vector = FieldSchema(name='vector', dtype=DataType.FLOAT_VECTOR, dim=dim)
field_1 = FieldSchema(name='field_1', dtype=DataType.VARCHAR, max_length=255)
field_2 = FieldSchema(name='field_2', dtype=DataType.INT64)
field_3 = FieldSchema(name='field_3', dtype=DataType.VARCHAR, max_length=255)
field_4 = FieldSchema(name='field_4', dtype=DataType.VARCHAR, max_length=255)
fields = [id, vector, field_1, field_2, field_3, field_4]
schema = CollectionSchema(fields=fields, description=alias)
collection = Collection(
    name=collection_name,
    schema=schema,
    using=alias,
)
index_params = {'index_type': 'IVF_PQ', 'params': {'nlist': 43759, 'm': 5, 'nbits': 64, 'max_empty_result_buckets': 3860}, 'metric_type': 'L2'}
# index_params = {'index_type': 'GPU_IVF_PQ', 'params': {'nlist': 43759, 'm': 5, 'nbits': 64, 'max_empty_result_buckets': 3860}, 'metric_type': 'L2'}
collection.create_index("vector", index_params, timeout=100)

dataset = []
number = 2382
for i in range(0,number + 0):
    vector = numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.choice(categories),
        'field_2': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_3': numpy_random.choice(categories),
        'field_4': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

dataset = []
number = 1902
for i in range(2382,number + 2382):
    vector = numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.choice(categories),
        'field_2': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_3': numpy_random.choice(categories),
        'field_4': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

vector = numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
query_vector = list(vector[0])
iterator = collection.search_iterator(
       data=[query_vector],
       anns_field="vector",
       param={"metric_type": "L2", "params": {'nprobe': 34569}},
       limit=9,
       expr='(field_4 not in ["grey", "grey", "brown", "purple", "purple"] || (field_1 == "pink" && field_3 == "brown"))',
       batch_size=478,
       timeout=100
    )
res1 = []
while True:
    result = iterator.next()
    if not result:
        iterator.close()
        break

    res1.extend(result)
collection.release()
collection.drop_index()
collection.flush()
print(res1)
collection.drop()

result:

[id: 71, distance: 910919.1875, entity: {}, id: 136, distance: 910919.1875, entity: {}, id: 350, distance: 910919.1875, entity: {}, id: 370, distance: 910919.1875, entity: {}, id: 676, distance: 910919.1875, entity: {}, id: 825, distance: 910919.1875, entity: {}, id: 835, distance: 910919.1875, entity: {}, id: 926, distance: 910919.1875, entity: {}, id: 987, distance: 910919.1875, entity: {}]
  1. Change 'index_type': 'IVF_PQ' to 'index_type': 'GPU_IVF_PQ' in index_params and run again.

result:

[id: 2131, distance: 1420920.0, entity: {}, id: 3474, distance: 1459279.0, entity: {}, id: 236, distance: 1465082.0, entity: {}, id: 2769, distance: 1466002.0, entity: {}, id: 541, distance: 1466613.0, entity: {}, id: 3222, distance: 1472281.0, entity: {}, id: 471, distance: 1473400.0, entity: {}, id: 3384, distance: 1479633.0, entity: {}, id: 3281, distance: 1498501.0, entity: {}]
  1. If inserting data only once or merging two data insertions, the results of using IVF_PQ and GPU_IVF_PQ index are the same.
yanliang567 commented 3 hours ago

lets track in #36608