milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.05k stars 2.95k forks source link

[Bug]: Vector in the DB not being found when searching on the exact vector. #27581

Closed OverlordRon closed 1 year ago

OverlordRon commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: 2.3.1
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): Ubuntu 20.04
- CPU/Memory: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz / 128 GB Memory
- GPU: 
- Others:

Current Behavior

When I upsert a collection of vectors into the db, and I try to search for one of the vectors I just upserted, the search results do not find the exact vector I am looking for.

Prepare index params

index_params = { "metric_type":"L2", # cos(ip)Euclidean distance "index_type":"FLAT", # for Floating point vectors "params":{"nlist":1024}, # parameters specific to index

"nlist" IVF_FLAT divides vector data into nlist cluster units

}

prepare search params

search_params = { "metric_type": "L2", # Euclidean distance, ARvind uses Inner Product (may require normalization of vectors) "offset": 10, # Retrieve 20 closest vectors (+/- 5) "ignore_growing": False, "params": {"nprobe": 10}, # number of cluster units to search, must be < nlist }

set search vector as a single vector example from the upserted data

vec = data[:][9][500]

search

results = collection.search( data=[vec], anns_field="vector", # name of the field to search on

the sum of offset in param and limit

# should be less than 16384.
param=search_params,
limit=2,
expr=None,
# set the names of the fields you want to 
# retrieve from the search result.
output_fields=['company_name','plaintext','vector'],
#consistency_level="Strong"

)

print(results[0].ids)

Output: [479, 571]

The output should contain [500] because that is the exact vector being searched.

Expected Behavior

The expected results[0].ids should contain [500], but it does not. Vector 500 was the exact vector being searched for in the DB.

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

xiaofan-luan commented 1 year ago

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version: 2.3.1
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): Ubuntu 20.04
- CPU/Memory: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz / 128 GB Memory
- GPU: 
- Others:

Current Behavior

When I upsert a collection of vectors into the db, and I try to search for one of the vectors I just upserted, the search results do not find the exact vector I am looking for.

Prepare index params

index_params = { "metric_type":"L2", # cos(ip)Euclidean distance "index_type":"FLAT", # for Floating point vectors "params":{"nlist":1024}, # parameters specific to index # "nlist" IVF_FLAT divides vector data into nlist cluster units }

prepare search params

search_params = { "metric_type": "L2", # Euclidean distance, ARvind uses Inner Product (may require normalization of vectors) "offset": 10, # Retrieve 20 closest vectors (+/- 5) "ignore_growing": False, "params": {"nprobe": 10}, # number of cluster units to search, must be < nlist }

set search vector as a single vector example from the upserted data

vec = data[:][9][500]

search

results = collection.search( data=[vec], anns_field="vector", # name of the field to search on # the sum of offset in param and limit # should be less than 16384. param=search_params, limit=2, expr=None, # set the names of the fields you want to # retrieve from the search result. output_fields=['company_name','plaintext','vector'], #consistency_level="Strong" )

print(results[0].ids)

Output: [479, 571]

The output should contain [500] because that is the exact vector being searched.

Expected Behavior

The expected results[0].ids should contain [500], but it does not. Vector 500 was the exact vector being searched for in the DB.

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

you can try to increase nprobe to 64 see if it works

xiaofan-luan commented 1 year ago

search_params = { "metric_type": "L2", # Euclidean distance, ARvind uses Inner Product (may require normalization of vectors) "offset": 10, # Retrieve 20 closest vectors (+/- 5) "ignore_growing": False, "params": {"nprobe": 10}, # number of cluster units to search, must be < nlist }

why you specify offset to be 10

xiaofan-luan commented 1 year ago

by setting offset to be 10, we skip the top 10 most similar vectors

OverlordRon commented 1 year ago

Yes! That is the reason. Thank you @xiaofan-luan . That was a big help