[Bug]: Using GPU_IVF_FLAT and IP to search wtih the same parameters as IVF_FLAT atfer two data insertions brings different results

qwevdb commented 1 month ago

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: milvus v2.4.12-gpu
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.4.5
- OS(Ubuntu or CentOS): Ubuntu 24.04 LTS
- CPU/Memory: Intel Core i7-11700 / 64G
- GPU: NVIDIA GeForce RTX 4090
- Others:

Current Behavior

The result of IVF_FLAT index and IP metric atfer two data insertions is different from GPU_IVF_PQ index and IP metric with the same parameters. If inserting data only once or merging two data insertions, the results are the same.

Expected Behavior

Both IVF_FLAT and GPU_IVF_FLAT index with IP metric and the same parameters can produce the same results.

Steps To Reproduce

Create an IVF_FLAT index with IP metric in the collection.
Insert data into collection twice.
Search

import time
from pymilvus import Collection, connections, FieldSchema, CollectionSchema, DataType, utility
import numpy as np

FLOAT_MAX = 5000
DATA_INT_MAX = 100
categories = ["green", "blue", "yellow", "red", "black", "white", "purple", "pink", "orange", "brown", "grey"] 

numpy_random = np.random.default_rng(0)
alias = "bench"
collection_name = "Benchmark"
client = connections.connect(
    alias=alias,
    host="localhost",
    port="19530"
)
if utility.has_collection(collection_name, using=alias):
    collection = Collection(name=collection_name, using=alias)
    collection.drop()
    time.sleep(2)  

dim = 824
id = FieldSchema(name='id', dtype=DataType.INT64, is_primary=True)
vector = FieldSchema(name='vector', dtype=DataType.FLOAT_VECTOR, dim=dim)
field_1 = FieldSchema(name='field_1', dtype=DataType.VARCHAR, max_length=255)
field_2 = FieldSchema(name='field_2', dtype=DataType.INT64)
field_3 = FieldSchema(name='field_3', dtype=DataType.VARCHAR, max_length=255)
field_4 = FieldSchema(name='field_4', dtype=DataType.VARCHAR, max_length=255)
fields = [id, vector, field_1, field_2, field_3, field_4]
schema = CollectionSchema(fields=fields, description=alias)
collection = Collection(
    name=collection_name,
    schema=schema,
    using=alias,
)
index_params = {'index_type': 'IVF_FLAT', 'params': {'nlist': 193, 'max_empty_result_buckets': 3565}, 'metric_type': 'IP'}
# index_params = {'index_type': 'GPU_IVF_FLAT', 'params': {'nlist': 193, 'max_empty_result_buckets': 3565}, 'metric_type': 'IP'}
collection.create_index("vector", index_params, timeout=100)

# first data insert
dataset = []
number = 1527
for i in range(0,number + 0):
    vector = numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.choice(categories),
        'field_2': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_3': numpy_random.choice(categories),
        'field_4': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

# second data insert
dataset = []
number = 486
for i in range(1527,number + 1527):
    vector = numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.choice(categories),
        'field_2': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_3': numpy_random.choice(categories),
        'field_4': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

vector = numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
query_vector = list(vector[0])
res1 = collection.search(
    data=[query_vector],
    anns_field="vector",
    param={"metric_type": "IP",
            "params": {'nprobe': 49}},
    limit = 3,
    expr='(field_4 == "yellow" || (field_3 < "red" and field_2 not in [68, 100, 69, 28, 24]))',
    timeout=100
    )
collection.release()
collection.drop_index()
collection.flush()
print(res1)
collection.drop()

result:

data: ["['id: 290, distance: 347435.0, entity: {}', 'id: 449, distance: 295739.0, entity: {}', 'id: 757, distance: 269902.0, entity: {}']"]

Change 'index_type': 'IVF_FLAT' to 'index_type': 'GPU_IVF_FLAT' in index_params and run again.

result:

data: ["['id: 290, distance: 347435.0, entity: {}', 'id: 757, distance: 269902.0, entity: {}', 'id: 672, distance: 260236.0, entity: {}']"]

If inserting data only once or merging two data insertions, the results of using IVF_FLAT and GPU_IVF_FLAT index are the same.

yanliang567 commented 1 month ago

dup to #36610

liliu-z commented 1 month ago

/assign /assign @Presburger

Presburger commented 1 month ago

@qwevdb Welcome to using the Milvus GPU version. You can try increasing the nprobe value if you need more accurate results. A smaller nprobe sacrifices recall for better performance.

qwevdb commented 1 month ago

@qwevdb Welcome to using the Milvus GPU version. You can try increasing the nprobe value if you need more accurate results. A smaller nprobe sacrifices recall for better performance.

It doesn't seem to be a problem with nprobe.

milvus-io / milvus