milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.52k stars 2.83k forks source link

[Bug]: search results is less than topk even when using nprobe=nlist #32618

Closed yanliang567 closed 4 months ago

yanliang567 commented 4 months ago

Is there an existing issue for this?

Environment

- Milvus version: master-20240425-f06509bf-amd64
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):

Current Behavior

search results is less than topk if topk is 1000

Expected Behavior

search results equals topk

Steps To Reproduce

1. create a collection
2. insert 10000 entities and flush
3. build index(IVF_FLAT, IVF_SQ8, HNSW)
4. load
5. search topk=1000 with high params(nprobe=nlist if IVF* index)

Milvus Log

reproduce code

# create
        name = cf.gen_unique_str(prefix)
        t0 = time.time()
        collection_w = self.init_collection_wrap(name=name, active_trace=True)
        tt = time.time() - t0
        assert collection_w.name == name

        # insert
        for _ in range(5):
            data = cf.gen_default_list_data()
            t0 = time.time()
            _, res = collection_w.insert(data)
            tt = time.time() - t0
            log.info(f"assert insert: {tt}")
            assert res

        # flush
        t0 = time.time()
        _, check_result = collection_w.flush(timeout=180)
        assert check_result
        assert collection_w.num_entities == len(data[0]) * 5
        tt = time.time() - t0
        entities = collection_w.num_entities
        log.info(f"assert flush: {tt}, entities: {entities}")

        # index
        # index_params = {"index_type": "HNSW", "params": {"M":32, "efConstruction": 360}, "metric_type": "L2"}
        index_params = {"index_type": "IVF_FLAT", "params": {"nlist": 64}, "metric_type": "L2"}
        t0 = time.time()
        index, _ = collection_w.create_index(field_name=ct.default_float_vec_field_name,
                                             index_params=index_params,
                                             index_name=cf.gen_unique_str())
        index, _ = collection_w.create_index(field_name=ct.default_string_field_name,
                                             index_params={},
                                             index_name=cf.gen_unique_str())
        tt = time.time() - t0
        log.info(f"assert index: {tt}")
        assert len(collection_w.indexes) == 2

        entities = collection_w.num_entities
        log.info(f"assert create collection: {tt}, init_entities: {entities}")

        # load
        collection_w.load()

        # search
        search_vectors = cf.gen_vectors(1, ct.default_dim)
        # search_params = {"metric_type": "L2", "params": {"ef": 2000}}
        search_params = {"metric_type": "L2", "params": {"nprobe": 64}}
        t0 = time.time()
        res_1, _ = collection_w.search(data=search_vectors,
                                       anns_field=ct.default_float_vec_field_name,
                                       param=search_params, limit=1000)
        tt = time.time() - t0
        log.info(f"assert search: {tt}")
        assert len(res_1[0]) == 1000

Anything else?

No response

yanliang567 commented 4 months ago

/assign @liliu-z /unassign

yanliang567 commented 4 months ago

if reproduces even with FLAT index :(

xiaofan-luan commented 4 months ago

this could be related to interim index?

  1. for some of the index type, we should not use interim index.
  2. the interim index has fixed nprobe, but seems need to be changed by some configs?
  3. Maybe rethink the implementation of interim index?
liliu-z commented 4 months ago

this could be related to interim index?

  1. for some of the index type, we should not use interim index.
  2. the interim index has fixed nprobe, but seems need to be changed by some configs?
  3. Maybe rethink the implementation of interim index?
  1. Interim index is disabled for FLAT
  2. Interim index's param is tunable through milvus.yaml
  3. Yes, we are on it.

Will take a look at this issue

liliu-z commented 4 months ago

/assign @zhengbuqian

zhengbuqian commented 4 months ago

data were generated by 5 calls of cf.gen_default_list_data(), this method generates identical int/float field data thus the pk fields of all 5 inserts were identical: from 0 to 1999.

modifying the insertion part of the script as follows can help the test case to pass:

        ttl = 10000
        batch = 200
        for x in range(ttl // batch):
            data = cf.gen_default_list_data(nb=batch, start = batch * x)
            t0 = time.time()
            _, res = collection_w.insert(data)
            tt = time.time() - t0
            log.info(f"assert insert: {tt}")
            assert res
zhengbuqian commented 4 months ago

with that being said, we still should have sufficient rows for a top-1000 query, still looking.

zhengbuqian commented 4 months ago

in the original script, we inserted all 10k rows in 5 batches, and then called flush, all 10k rows resulted in the same sealed segment. Only the 2k rows in the last batch were valid, the previous 8k rows were invalid due to duplicate primary keys.

I checked at search time, the only knowhere index of the only sealed segment contains 10k rows, but the bitset passed in did not filter out a single row.

So knowhere returned top-1k of all 10k rows, then milvus removed some of the 1k results since they are in the first 8k.

Milvus failed to properly set up the bitset before sending to knowhere.

The other issue is: insert should fail on duplicate primary key, overwrite should happen only for upsert.

zhengbuqian commented 4 months ago

/assign @yanliang567

zhengbuqian commented 4 months ago

/unassign

zhengbuqian commented 4 months ago

chatted with @liliu-z, this is the current expected behavior to have less than k results in the occurrence of duplicate pk

liliu-z commented 4 months ago

Looks like the root cause is duplicated pk. For this, we have two ways to treat it:

  1. If duplicated pk is expected, then we should not reduce on it.
  2. If duplicated pk is not expected we need to mark it as a bug until it get fixed.

/unassign

yanliang567 commented 4 months ago

Mmm...let me check my script today.

yanliang567 commented 4 months ago

my mistakes in the test script.:(