milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.71k stars 2.93k forks source link

[Bug]: Using GPU_IVF_FLAT and IP to search wtih the same parameters as IVF_FLAT atfer two data insertions brings different results #36607

Closed qwevdb closed 1 month ago

qwevdb commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version: milvus v2.4.12-gpu
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.4.5
- OS(Ubuntu or CentOS): Ubuntu 24.04 LTS
- CPU/Memory: Intel Core i7-11700 / 64G
- GPU: NVIDIA GeForce RTX 4090
- Others:

Current Behavior

The result of IVF_FLAT index and IP metric atfer two data insertions is different from GPU_IVF_PQ index and IP metric with the same parameters. If inserting data only once or merging two data insertions, the results are the same.

Expected Behavior

Both IVF_FLAT and GPU_IVF_FLAT index with IP metric and the same parameters can produce the same results.

Steps To Reproduce

  1. Create an IVF_FLAT index with IP metric in the collection.
  2. Insert data into collection twice.
  3. Search
import time
from pymilvus import Collection, connections, FieldSchema, CollectionSchema, DataType, utility
import numpy as np

FLOAT_MAX = 5000
DATA_INT_MAX = 100
categories = ["green", "blue", "yellow", "red", "black", "white", "purple", "pink", "orange", "brown", "grey"] 

numpy_random = np.random.default_rng(0)
alias = "bench"
collection_name = "Benchmark"
client = connections.connect(
    alias=alias,
    host="localhost",
    port="19530"
)
if utility.has_collection(collection_name, using=alias):
    collection = Collection(name=collection_name, using=alias)
    collection.drop()
    time.sleep(2)  

dim = 824
id = FieldSchema(name='id', dtype=DataType.INT64, is_primary=True)
vector = FieldSchema(name='vector', dtype=DataType.FLOAT_VECTOR, dim=dim)
field_1 = FieldSchema(name='field_1', dtype=DataType.VARCHAR, max_length=255)
field_2 = FieldSchema(name='field_2', dtype=DataType.INT64)
field_3 = FieldSchema(name='field_3', dtype=DataType.VARCHAR, max_length=255)
field_4 = FieldSchema(name='field_4', dtype=DataType.VARCHAR, max_length=255)
fields = [id, vector, field_1, field_2, field_3, field_4]
schema = CollectionSchema(fields=fields, description=alias)
collection = Collection(
    name=collection_name,
    schema=schema,
    using=alias,
)
index_params = {'index_type': 'IVF_FLAT', 'params': {'nlist': 193, 'max_empty_result_buckets': 3565}, 'metric_type': 'IP'}
# index_params = {'index_type': 'GPU_IVF_FLAT', 'params': {'nlist': 193, 'max_empty_result_buckets': 3565}, 'metric_type': 'IP'}
collection.create_index("vector", index_params, timeout=100)

# first data insert
dataset = []
number = 1527
for i in range(0,number + 0):
    vector = numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.choice(categories),
        'field_2': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_3': numpy_random.choice(categories),
        'field_4': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

# second data insert
dataset = []
number = 486
for i in range(1527,number + 1527):
    vector = numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
    data = {
        'id': i,
        'vector': list(vector[0]),
        'field_1': numpy_random.choice(categories),
        'field_2': numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX),
        'field_3': numpy_random.choice(categories),
        'field_4': numpy_random.choice(categories)
    }
    dataset.append(data)
collection.insert(dataset)
collection.flush()
collection.load()

vector = numpy_random.integers(-DATA_INT_MAX, DATA_INT_MAX, (1, dim))
query_vector = list(vector[0])
res1 = collection.search(
    data=[query_vector],
    anns_field="vector",
    param={"metric_type": "IP",
            "params": {'nprobe': 49}},
    limit = 3,
    expr='(field_4 == "yellow" || (field_3 < "red" and field_2 not in [68, 100, 69, 28, 24]))',
    timeout=100
    )
collection.release()
collection.drop_index()
collection.flush()
print(res1)
collection.drop()

result:

data: ["['id: 290, distance: 347435.0, entity: {}', 'id: 449, distance: 295739.0, entity: {}', 'id: 757, distance: 269902.0, entity: {}']"]
  1. Change 'index_type': 'IVF_FLAT' to 'index_type': 'GPU_IVF_FLAT' in index_params and run again.

result:

data: ["['id: 290, distance: 347435.0, entity: {}', 'id: 757, distance: 269902.0, entity: {}', 'id: 672, distance: 260236.0, entity: {}']"]
  1. If inserting data only once or merging two data insertions, the results of using IVF_FLAT and GPU_IVF_FLAT index are the same.
yanliang567 commented 1 month ago

dup to #36610

liliu-z commented 1 month ago

/assign /assign @Presburger

Presburger commented 1 month ago

@qwevdb Welcome to using the Milvus GPU version. You can try increasing the nprobe value if you need more accurate results. A smaller nprobe sacrifices recall for better performance.

qwevdb commented 1 month ago

@qwevdb Welcome to using the Milvus GPU version. You can try increasing the nprobe value if you need more accurate results. A smaller nprobe sacrifices recall for better performance.

It doesn't seem to be a problem with nprobe.