milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
27.28k stars 2.63k forks source link

[Bug]: INVERTED scalar filter has low precision in query/search #32717

Closed ghallsimpsons closed 1 week ago

ghallsimpsons commented 3 weeks ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.4.0
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory: indexnode: 4x(2 CPU, 2GB); querynode: 2x(8CPU, 32GB)
- GPU: No
- Others:

Current Behavior

When running client.query(..., expr="my_ind == 1") where my_ind is of int type (tested w/ int16 and int32) and the index is INVERTED, only a small (though statistically significant) fraction of the results satisfy the condition. Typical precision is 20-40% (with a 10% underlying density). STL_SORT and no index both have 100% precision.

Expected Behavior

Either query(..., expr="my_ind == 1") should have 100% precision, or the documentation should be updated to describe the expected behavior.

Steps To Reproduce

from pymilvus import FieldSchema, CollectionSchema, DataType, MilvusClient
import numpy as np

idx = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True)
vector = FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=128)
no_index = FieldSchema(name="no_index", dtype=DataType.INT16)
default_index = FieldSchema(name="default_index", dtype=DataType.INT16)
inv_index = FieldSchema(name="inv_index", dtype=DataType.INT16)
stl_index = FieldSchema(name="stl_index", dtype=DataType.INT16)
schema = CollectionSchema(fields=[idx, vector, no_index, default_index, inv_index, stl_index], auto_id=True)
client = MilvusClient()
client.drop_collection("index_test")
client.create_collection("index_test", schema=schema)

# Create (or remove) indices
index_params = client.prepare_index_params()
index_params.add_index(
    field_name="default_index",
    index_name="default_index"
)
index_params.add_index(
    field_name="inv_index",
    index_type="INVERTED",
    index_name="inv_index"
)
index_params.add_index(
    field_name="stl_index",
    index_type="STL_SORT",
    index_name="stl_index"
)
index_params.add_index(
    field_name="vector",
    index_type="IVF_SQ8",
    metric_type="L2",
    params={"nlist": 128},
)
client.create_index(
  collection_name="index_test",
  index_params=index_params
)
client.drop_index("index_test", "no_index")

# Make the collection large enough that the indexes are used
for _ in range(10000):
    data = []
    for _ in range(100):
        data.append(
            {
                "vector": np.random.rand(128),
                "no_index": np.random.randint(1000),
                "default_index": np.random.randint(1000),
                "inv_index": np.random.randint(1000),
                "stl_index": np.random.randint(1000),
            }
        )
    client.insert(
        "index_test",
        data=data,
    )

for key in ["no_index", "default_index", "stl_index", "inv_index"]:
    filt = f"{key} in {[i for i in range(1, 1000, 10)]}"
    client.load_collection("index_test")
    all_rows = client.query(
        "index_test",
        limit=128,
        output_fields=["no_index", "default_index", "inv_index", "stl_index"],
        filter=filt,
    )
correct_rows = [row[key] for row in all_rows if row[key] % 10 == 1]
print(f"Index {key}: Total of {len(all_rows)} rows")
print(f"Index {key}: Total of {len(correct_rows)} correct rows")


### Milvus Log

_No response_

### Anything else?

Based on these results, I believe this documentation is also wrong, and that the default scalar index for v2.4 is `INVERTED`: https://milvus.io/docs/scalar_index.md#Default-indexing
yanliang567 commented 3 weeks ago

/assign @longjiquan please help to take a look, meanwhile, i will try to reproduce it in house

xiaofan-luan commented 2 weeks ago
INVERTED

@ghallsimpsons should you use same random number for different fields? otherwise how did you specify your ground truth? both index should have 100% recall.

ghallsimpsons commented 2 weeks ago

@ghallsimpsons should you use same random number for different fields? otherwise how did you specify your ground truth? both index should have 100% recall.

Hi ~xiaofan-luan, thanks for helping look into this. There is no ground truth here per se, except for what I am requesting via the query. That is, if I perform a search and add the filter inv_index == 1, I would expect every returned row to have inv_index == 1. This is true of the STL index and the no-index case, but not for the inverted index.

xiaofan-luan commented 2 weeks ago

could you share you code and what is the result you get?

yanliang567 commented 2 weeks ago

I have reproduced the issue in house with the code above.

Index no_index: Total of 128 rows
Index no_index: Total of 128 correct rows
Index default_index: Total of 128 rows
Index default_index: Total of 49 correct rows
Index stl_index: Total of 128 rows
Index stl_index: Total of 128 correct rows
Index inv_index: Total of 128 rows
Index inv_index: Total of 49 correct rows

we can see that when filtering with the inverted field, it returns some results that do not in the filter list. e.g. image

longjiquan commented 1 week ago

thanks for reporting the bug, @ghallsimpsons , already fixed in https://github.com/milvus-io/milvus/pull/32858

ghallsimpsons commented 1 week ago

Very nice, thanks for the quick fix! I'll give it a go again when 2.4.2 is released.