milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
28.93k stars 2.79k forks source link

[Bug]: Inconsistent Query Results with Identical Filters in Milvus #33350

Closed HantaoCai closed 1 month ago

HantaoCai commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version:V2.3.15
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka): pulsar

Current Behavior

Hello,

I've encountered an issue while performing queries in Milvus. I noticed that using the same filtering criteria results in different outcomes. Attached is a video demonstrating this behavior:

video file

Additionally, the backup file which was exported using the backup tool is too large to upload directly. Could you please provide me with an email address to which I can send it?

Your assistance in investigating the cause of this inconsistency would be greatly appreciated.

Thank you.

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

yanliang567 commented 2 months ago

@HantaoCai please send me in mail: yanliang.qiao@zilliz.com, if you could also offer milvus logs, it would be perfect

yanliang567 commented 2 months ago

/assign @HantaoCai

HantaoCai commented 2 months ago

The document has been sent, please check your inbox.

yanliang567 commented 2 months ago

@zhuwenxing is trying to reproduce the issue with your data

zhuwenxing commented 2 months ago

yes, It can be reproduced

reproduce script

from pymilvus import connections, Collection
connections.connect()
c = Collection(name="gemini_library_v5_bak")
print(c.describe())
map = {}
res = c.query(expr="file_id == 3058", output_fields=["*"])
pk_list= [r['index_id'] for r in res]
for i in range(1):
    res = c.query(expr="file_id == 3058", output_fields=["*"])
    print(len(res))
    assert set(pk_list) == set([r['index_id'] for r in res])
    for r in res:
        if r['index_id'] not in map:
            map[r['index_id']] = r['tags']
        else:
            map[r['index_id']] += r['tags']
print(f"first time query, then compare following time query result with first time query result")

for i in range(10):
    diff_cnt = 0
    tmp = {}
    res = c.query(expr="file_id == 3058", output_fields=["*"])
    print(len(res))
    assert set(pk_list) == set([r['index_id'] for r in res])
    for r in res:
        tmp[r['index_id']] = r['tags']
        if map[r['index_id']] != r['tags']:
            diff_cnt += 1
            # print(f"diff found: {r['index_id']}, {map[r['index_id']]} != {r['tags']}")
    print(f"in compare time  {i}, find diff count: {diff_cnt}")

result: same index_id, the value of tags sometimes is ["1"], and sometimes is []

diff found: ff386894910459ec6cc058b0e395015f, ['1'] != []
diff found: ff42158f2e5623c5fe07d9cff4cd9965, ['1'] != []
diff found: ff68acd76bfbdfc189e6d87631810a40, ['1'] != []
diff found: ff84615dbc0a42a8509dbc11b0b7fac5, ['1'] != []
diff found: ff8c617c8c4633a5598968f661f7343c, ['1'] != []
diff found: ffd0810e0b8993435e8e8792204d1a65, ['1'] != []
diff found: ffd1b8e0759fd12991a255f4d2f522fb, ['1'] != []
diff found: ffe59056e7a3235628c80d2150b13dbd, ['1'] != []
diff found: fff08735118cc834861d4b9a892cbb6b, ['1'] != []
diff found: fff89ffe055b52cd804a21fa0e11475e, ['1'] != []
diff found: fff9e71c0e45270462cee2d3fe80c6d5, ['1'] != []
{'collection_name': 'gemini_library_v5_bak', 'auto_id': False, 'num_shards': 1, 'description': 'gemini矢量表', 'fields': [{'field_id': 100, 'name': 'index_id', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 110}, 'is_primary': True}, {'field_id': 101, 'name': 'vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 1536}}, {'field_id': 102, 'name': 'partition_id', 'description': '', 'type': <DataType.INT64: 5>, 'params': {}}, {'field_id': 103, 'name': 'file_id', 'description': '', 'type': <DataType.INT64: 5>, 'params': {}}, {'field_id': 104, 'name': 'chunk_id', 'description': '', 'type': <DataType.INT64: 5>, 'params': {}}, {'field_id': 105, 'name': 'tags', 'description': '', 'type': <DataType.ARRAY: 22>, 'params': {'max_length': 200, 'max_capacity': 1024}, 'element_type': <DataType.VARCHAR: 21>}], 'aliases': [], 'collection_id': 450111991800541421, 'consistency_level': 2, 'properties': {}, 'num_partitions': 65, 'enable_dynamic_field': True}
2414
first time query, then compare following time query result with first time query result
2414
in compare time  0, find diff count: 2414
2414
in compare time  1, find diff count: 0
2414
in compare time  2, find diff count: 0
2414
in compare time  3, find diff count: 0
2414
in compare time  4, find diff count: 0
2414
in compare time  5, find diff count: 0
2414
in compare time  6, find diff count: 0
2414
in compare time  7, find diff count: 0
2414
in compare time  8, find diff count: 0
2414
in compare time  9, find diff count: 2414
xiaofan-luan commented 2 months ago

/assign @longjiquan

zhuwenxing commented 2 months ago

add a step to check count of each pk, found that each pk has two entities.

So it should be the same PK, but with different data inserted.

for k, v in map.items():
    res = c.query(expr=f"index_id == '{k}'", output_fields=["count(*)"])
    print(f"{k} {res}")
ffd1b8e0759fd12991a255f4d2f522fb data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
ffe59056e7a3235628c80d2150b13dbd data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
fff08735118cc834861d4b9a892cbb6b data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
fff89ffe055b52cd804a21fa0e11475e data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
fff9e71c0e45270462cee2d3fe80c6d5 data: ["{'count(*)': 2}"] ..., extra_info: {'cost': 0}
yanliang567 commented 2 months ago

@HantaoCai it proves that that data you inserted have dup primary keys, which causes query results change. please try to de-dup the data first /assign @HantaoCai /unassign @zhuwenxing @longjiquan

HantaoCai commented 2 months ago

I previously inquired about why multiple records with the same primary key could be inserted, and the response I received was that the data retrieved would be the new record with the same primary key. This behavior is different from what I am currently experiencing. Should this be considered a bug?

We are long-time users of Milvus, and version 2.2 does not have an upsert feature, which means our historical code might contain Insert methods. Additionally, we expect, as with traditional databases, that primary keys should be unique. Therefore, from a user's perspective, the Insert method is not very meaningful, and we believe that the use of upsert should be favored instead.

zhuwenxing commented 2 months ago

the response I received was that the data retrieved would be the new record with the same primary key.

Yes, the expected behavior should be like this.

We are investigating whether restoring or importing, if there are identical primary keys, would result in identical timestamps, thus leading to the current issue.

xiaofan-luan commented 2 months ago

@HantaoCai I think there is no clear way to cover all PK. under filtering and search this is not doable. Even without upsert you might be able to delete the old data and insert new one to avoid duplication

HantaoCai commented 2 months ago

We will be cleaning up our data and updating all historical code to use the upsert method.

We would like to see either all records with the same PK being retrieved during queries, or ensure that only the latest record for a given PK is returned. This would help us to quickly identify or prevent these issues. However, we also recognize that retrieving only the latest record for a PK may not always be the best approach.

In any case, thank you for your assistance in investigating the root cause of the issue.

HantaoCai commented 2 months ago

When I set out to clean up the data, I encountered a problem. It appears that this is not a simple task; I am unable to delete data using the PK as a reference.

Is there a way for me to filter out the data that has the same primary key but is older in terms of the timestamp?

HantaoCai commented 2 months ago

Regarding this issue with Attu, my current goal is to delete records with duplicate primary keys. I intend to remove the older records. After comparing with the scalar database, I have identified which records need to be deleted. However, during my test deletion on Attu, I found that the delete expression generated by Attu is based on the primary key rather than the filter criteria I provided. This has led to all records with the same primary key being deleted. For the product's deletion feature, I would expect the delete expression to be generated using my filter criteria, not the primary key, because the primary key is not unique at the moment. @xiaofan-luan @yanliang567

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.