[Bug]: Using GPU_IVF_PQ and IP to search wtih the same parameters as IVF_PQ atfer two or more data insertions brings different results

qwevdb commented 6 hours ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: milvus v2.4.12-gpu
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.4.5
- OS(Ubuntu or CentOS): Ubuntu 24.04 LTS
- CPU/Memory: Intel Core i7-11700 / 64G
- GPU: NVIDIA GeForce RTX 4090
- Others:

Current Behavior

The result of IVF_PQ index and IP metric atfer two or more data insertions is different from GPU_IVF_PQ index and IP metric with the same parameters. If inserting data only once or merging two or more data insertions, the results are the same.

Expected Behavior

Both IVF_PQ and GPU_IVF_PQ index with IP metric and the same parameters can produce the same results.

Steps To Reproduce

Create an IVF_PQ index with IP metric in the collection.
Insert data into collection twice.
Search

import time
from pymilvus import Collection, connections, FieldSchema, CollectionSchema, DataType, utility
import numpy as np

FLOAT_MAX = 5000
DATA_INT_MAX = 100
categories = ["green", "blue", "yellow", "red", "black", "white", "purple", "pink", "orange", "brown", "grey"] 

numpy_random = np.random.default_rng(0)
alias = "bench"
collection_name = "Benchmark"
client = connections.connect(
    alias=alias,
    host="localhost",
    port="19530"
)
if utility.has_collection(collection_name, using=alias):
    collection = Collection(name=collection_name, using=alias)
    collection.drop()
    time.sleep(2)  

dim = 660
id = FieldSchema(name='id', dtype=DataType.INT64, is_primary=True)
vector = FieldSchema(name='vector', dtype=DataType.FLOAT_VECTOR, dim=dim)
field_1 = FieldSchema(name='field_1', dtype=DataType.INT64)
field_2 = FieldSchema(name='field_2', dtype=DataType.VARCHAR, max_length=255)
field_3 = FieldSchema(name='field_3', dtype=DataType.VARCHAR, max_length=255)
fields = [id, vector, field_1, field_2, field_3]
schema = CollectionSchema(fields=fields, description=alias)
collection = Collection(
    name=collection_name,
    schema=schema,
    using=alias,
)
index_params = {'index_type': 'IVF_PQ', 'params': {'nlist': 6949, 'm': 11, 'nbits': 13, 'max_empty_result_buckets': 36547}, 'metric_type': 'IP'}
# index_params = {'index_type': 'GPU_IVF_PQ', 'params': {'nlist': 6949, 'm': 11, 'nbits': 13, 'max_empty_result_buckets': 36547}, 'metric_type': 'IP'}
collection.create_index("vector", index_params, timeout=100)

dataset = []
number = 2306
for i in range(0,number + 0):
    vector = numpy_random.uniform(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_2': numpy_random.choice(categories),
        'field_3': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

dataset = []
number = 1103
for i in range(2306,number + 2306):
    vector = numpy_random.uniform(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_2': numpy_random.choice(categories),
        'field_3': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

vector = numpy_random.uniform(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
query_vector = list(vector[0])
iterator = collection.search_iterator(
       data=[query_vector],
       anns_field="vector",
       param={"metric_type": "IP", "params": {'nprobe': 4718}},
       limit=7,
       expr='((field_2 == "grey" and field_3 not in ["orange"]) && field_3 in ["white", "white", "purple"])',
       batch_size=7772,
       timeout=100
    )
res1 = []
while True:
    result = iterator.next()
    if not result:
        iterator.close()
        break

    res1.extend(result)
collection.release()
collection.drop_index()
collection.flush()
print(res1)
collection.drop()

result:

[id: 2837, distance: 152213.109375, entity: {}, id: 2747, distance: 100040.125, entity: {}, id: 2731, distance: 86614.109375, entity: {}, id: 2686, distance: 82579.1484375, entity: {}, id: 147, distance: 77792.6953125, entity: {}, id: 138, distance: 67749.7265625, entity: {}, id: 2516, distance: 64471.109375, entity: {}]

Change 'index_type': 'IVF_PQ' to 'index_type': 'GPU_IVF_PQ' in index_params and run again.

result:

[id: 2837, distance: 152213.109375, entity: {}, id: 694, distance: 120093.8984375, entity: {}, id: 550, distance: 111660.0703125, entity: {}, id: 2747, distance: 100040.125, entity: {}, id: 2731, distance: 86614.109375, entity: {}, id: 2686, distance: 82579.1484375, entity: {}, id: 459, distance: 72341.109375, entity: {}]

If inserting data only once or merging two data insertions, the results of using IVF_PQ and GPU_IVF_PQ index are the same.

yanliang567 commented 3 hours ago

sounds like reasonable if they are in 2 or more segments. @Presburger please help to double check.

/assign @Presburger /unassign

yanliang567 commented 3 hours ago

lets track IVF_ and GPUIVF index types in the issue only.

liliu-z commented 1 hour ago

Hi @qwevdb, there is a clarification question: Sounds like you did insertion random data->search for both IVF_PQ and GPU_IVF_PQ. Since data is random, how can we ensure these two indexes have the same data?

qwevdb commented 53 minutes ago

Hi @qwevdb, there is a clarification question: Sounds like you did insertion random data->search for both IVF_PQ and GPU_IVF_PQ. Since data is random, how can we ensure these two indexes have the same data?

Use the same random seed to ensure generating the same random data.

milvus-io / milvus