milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.67k stars 2.85k forks source link

[Bug]: Using GPU_IVF_PQ and IP to search wtih the same parameters as IVF_PQ atfer two or more data insertions brings different results #36608

Open qwevdb opened 6 hours ago

qwevdb commented 6 hours ago

Is there an existing issue for this?

Environment

- Milvus version: milvus v2.4.12-gpu
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.4.5
- OS(Ubuntu or CentOS): Ubuntu 24.04 LTS
- CPU/Memory: Intel Core i7-11700 / 64G
- GPU: NVIDIA GeForce RTX 4090
- Others:

Current Behavior

The result of IVF_PQ index and IP metric atfer two or more data insertions is different from GPU_IVF_PQ index and IP metric with the same parameters. If inserting data only once or merging two or more data insertions, the results are the same.

Expected Behavior

Both IVF_PQ and GPU_IVF_PQ index with IP metric and the same parameters can produce the same results.

Steps To Reproduce

  1. Create an IVF_PQ index with IP metric in the collection.
  2. Insert data into collection twice.
  3. Search
import time
from pymilvus import Collection, connections, FieldSchema, CollectionSchema, DataType, utility
import numpy as np

FLOAT_MAX = 5000
DATA_INT_MAX = 100
categories = ["green", "blue", "yellow", "red", "black", "white", "purple", "pink", "orange", "brown", "grey"] 

numpy_random = np.random.default_rng(0)
alias = "bench"
collection_name = "Benchmark"
client = connections.connect(
    alias=alias,
    host="localhost",
    port="19530"
)
if utility.has_collection(collection_name, using=alias):
    collection = Collection(name=collection_name, using=alias)
    collection.drop()
    time.sleep(2)  

dim = 660
id = FieldSchema(name='id', dtype=DataType.INT64, is_primary=True)
vector = FieldSchema(name='vector', dtype=DataType.FLOAT_VECTOR, dim=dim)
field_1 = FieldSchema(name='field_1', dtype=DataType.INT64)
field_2 = FieldSchema(name='field_2', dtype=DataType.VARCHAR, max_length=255)
field_3 = FieldSchema(name='field_3', dtype=DataType.VARCHAR, max_length=255)
fields = [id, vector, field_1, field_2, field_3]
schema = CollectionSchema(fields=fields, description=alias)
collection = Collection(
    name=collection_name,
    schema=schema,
    using=alias,
)
index_params = {'index_type': 'IVF_PQ', 'params': {'nlist': 6949, 'm': 11, 'nbits': 13, 'max_empty_result_buckets': 36547}, 'metric_type': 'IP'}
# index_params = {'index_type': 'GPU_IVF_PQ', 'params': {'nlist': 6949, 'm': 11, 'nbits': 13, 'max_empty_result_buckets': 36547}, 'metric_type': 'IP'}
collection.create_index("vector", index_params, timeout=100)

dataset = []
number = 2306
for i in range(0,number + 0):
    vector = numpy_random.uniform(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_2': numpy_random.choice(categories),
        'field_3': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

dataset = []
number = 1103
for i in range(2306,number + 2306):
    vector = numpy_random.uniform(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_2': numpy_random.choice(categories),
        'field_3': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

vector = numpy_random.uniform(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
query_vector = list(vector[0])
iterator = collection.search_iterator(
       data=[query_vector],
       anns_field="vector",
       param={"metric_type": "IP", "params": {'nprobe': 4718}},
       limit=7,
       expr='((field_2 == "grey" and field_3 not in ["orange"]) && field_3 in ["white", "white", "purple"])',
       batch_size=7772,
       timeout=100
    )
res1 = []
while True:
    result = iterator.next()
    if not result:
        iterator.close()
        break

    res1.extend(result)
collection.release()
collection.drop_index()
collection.flush()
print(res1)
collection.drop()

result:

[id: 2837, distance: 152213.109375, entity: {}, id: 2747, distance: 100040.125, entity: {}, id: 2731, distance: 86614.109375, entity: {}, id: 2686, distance: 82579.1484375, entity: {}, id: 147, distance: 77792.6953125, entity: {}, id: 138, distance: 67749.7265625, entity: {}, id: 2516, distance: 64471.109375, entity: {}]
  1. Change 'index_type': 'IVF_PQ' to 'index_type': 'GPU_IVF_PQ' in index_params and run again.

result:

[id: 2837, distance: 152213.109375, entity: {}, id: 694, distance: 120093.8984375, entity: {}, id: 550, distance: 111660.0703125, entity: {}, id: 2747, distance: 100040.125, entity: {}, id: 2731, distance: 86614.109375, entity: {}, id: 2686, distance: 82579.1484375, entity: {}, id: 459, distance: 72341.109375, entity: {}]
  1. If inserting data only once or merging two data insertions, the results of using IVF_PQ and GPU_IVF_PQ index are the same.
yanliang567 commented 3 hours ago

sounds like reasonable if they are in 2 or more segments. @Presburger please help to double check.

/assign @Presburger /unassign

yanliang567 commented 3 hours ago

lets track IVF_ and GPUIVF index types in the issue only.

liliu-z commented 1 hour ago

Hi @qwevdb, there is a clarification question: Sounds like you did insertion random data->search for both IVF_PQ and GPU_IVF_PQ. Since data is random, how can we ensure these two indexes have the same data?

qwevdb commented 53 minutes ago

Hi @qwevdb, there is a clarification question: Sounds like you did insertion random data->search for both IVF_PQ and GPU_IVF_PQ. Since data is random, how can we ensure these two indexes have the same data?

Use the same random seed to ensure generating the same random data.