milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.08k stars 2.88k forks source link

[Bug]: Search result length is not equal to the limit(topK) value after reinstallation #24613

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:2.2.0-20230601-5710752f
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): kafka   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.302 | INFO     | MainThread |utils:load_and_search:206 - collection name: task_2_IVF_PQ

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.302 | INFO     | MainThread |utils:load_and_search:207 - load collection

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.309 | INFO     | MainThread |utils:load_and_search:211 - load time: 0.0070

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.320 | INFO     | MainThread |utils:load_and_search:225 - {'metric_type': 'L2', 'params': {'nprobe': 10}}

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.320 | INFO     | MainThread |utils:load_and_search:228 - 

[2023-06-01T13:05:03.722Z] Search...

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.327 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 930, distance: 28.98775291442871, entity: {'count': 930, 'random_value': -13.0}

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.327 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2343, distance: 31.38789176940918, entity: {'count': 2343, 'random_value': -16.0}

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.327 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 1325, distance: 31.5164852142334, entity: {'count': 1325, 'random_value': -15.0}

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.327 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2867, distance: 32.024906158447266, entity: {'count': 2867, 'random_value': -18.0}

[2023-06-01T13:05:03.722Z] Traceback (most recent call last):

[2023-06-01T13:05:03.722Z]   File "scripts/action_after_reinstall.py", line 47, in <module>

[2023-06-01T13:05:03.722Z]     task_2(data_size, host)

[2023-06-01T13:05:03.722Z]   File "scripts/action_after_reinstall.py", line 29, in task_2

[2023-06-01T13:05:03.722Z]     load_and_search(prefix)

[2023-06-01T13:05:03.722Z]   File "/home/jenkins/agent/workspace/tests/python_client/deploy/scripts/utils.py", line 241, in load_and_search

[2023-06-01T13:05:03.722Z]     assert len(ids) == topK, f"get {len(ids)} results, but topK is {topK}"

[2023-06-01T13:05:03.722Z] AssertionError: get 4 results, but topK is 5

Expected Behavior

len(ids) == topK

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/993/pipeline

log:

artifacts-kafka-cluster-reinstall-993-server-first-deployment-logs.tar.gz

artifacts-kafka-cluster-reinstall-993-server-second-deployment-logs.tar.gz

artifacts-kafka-cluster-reinstall-993-pytest-logs.tar.gz

Anything else?

No response

yanliang567 commented 1 year ago

/assign @jiaoew1991 /unassign

xiaofan-luan commented 1 year ago

/assign @chyezh

chyezh commented 1 year ago

it seems that there's no data loss after reinstallation. image all data has been flushed, so the problem cannot be caused by growing segments.

the problem may arise in the computational logic with special input, I will try to reproduce it.

zhuwenxing commented 1 year ago

version: 2.2.0-20230612-ae2fe478

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/1044/pipeline

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.198 | INFO     | MainThread |utils:load_and_search:206 - collection name: task_1_IVF_FLAT

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.198 | INFO     | MainThread |utils:load_and_search:207 - load collection

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.203 | INFO     | MainThread |utils:load_and_search:211 - load time: 0.0050

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.216 | INFO     | MainThread |utils:load_and_search:225 - {'metric_type': 'L2', 'params': {'nprobe': 10}}

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.216 | INFO     | MainThread |utils:load_and_search:228 - 

[2023-06-12T13:05:57.358Z] Search...

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.220 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 976, distance: 29.795345306396484, entity: {'count': 976, 'random_value': -15.0}

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.221 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 766, distance: 30.546741485595703, entity: {'count': 766, 'random_value': -11.0}

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.221 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2403, distance: 31.58251953125, entity: {'count': 2403, 'random_value': -17.0}

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.221 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2486, distance: 32.51908874511719, entity: {'count': 2486, 'random_value': -12.0}

[2023-06-12T13:05:57.358Z] Traceback (most recent call last):

[2023-06-12T13:05:57.358Z]   File "scripts/action_after_reinstall.py", line 46, in <module>

[2023-06-12T13:05:57.358Z]     task_1(data_size, host)

[2023-06-12T13:05:57.358Z]   File "scripts/action_after_reinstall.py", line 14, in task_1

[2023-06-12T13:05:57.358Z]     load_and_search(prefix)

[2023-06-12T13:05:57.358Z]   File "/home/jenkins/agent/workspace/tests/python_client/deploy/scripts/utils.py", line 241, in load_and_search

[2023-06-12T13:05:57.358Z]     assert len(ids) == topK, f"get {len(ids)} results, but topK is {topK}"

[2023-06-12T13:05:57.358Z] AssertionError: get 4 results, but topK is 5

log:

artifacts-kafka-standalone-reinstall-1044-pytest-logs.tar.gz

[Uploading artifacts-kafka-standalone-reinstall-1044-server-first-deployment-logs.tar.gz…]()

artifacts-kafka-standalone-reinstall-1044-server-second-deployment-logs.tar.gz

zhuwenxing commented 1 year ago

/assign @congqixia please take a look. the search or query result is partial.

zhuwenxing commented 1 year ago

It reproduced again with image tag 2.2.0-20230707-511173a0 failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/1179/pipeline

log:

artifacts-kafka-cluster-reinstall-1179-pytest-logs.tar.gz artifacts-kafka-cluster-reinstall-1179-server-first-deployment-logs.tar.gz artifacts-kafka-cluster-reinstall-1179-server-second-deployment-logs.tar.gz

chyezh commented 1 year ago

Setup

Debug

No segments lost here

The difference of two search operation: Using Index after reinstalling, Not using index before reinstalling

Is that possible, by using IVF_FLAT, 10 vector was recalled in 10 cluster in IVF, but filter the 6 vector by expr count > 500? the search vector is [1,1,1,1,....] locating the corner of the vector space.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

yanliang567 commented 1 year ago

@zhuwenxing @chyezh any updates

zhuwenxing commented 1 year ago

image: 2.3.0-20230918-dde27711-amd64

[2023-09-18T13:38:15.243Z] 2023-09-18 13:38:15.140 | INFO     | MainThread |utils:load_and_search:259 - ###########

[2023-09-18T13:38:15.243Z] 2023-09-18 13:38:15.143 | INFO     | MainThread |utils:load_and_search:206 - collection name: task_2_IVF_FLAT

[2023-09-18T13:38:15.243Z] 2023-09-18 13:38:15.143 | INFO     | MainThread |utils:load_and_search:207 - load collection

[2023-09-18T13:38:19.400Z] 2023-09-18 13:38:19.232 | INFO     | MainThread |utils:load_and_search:211 - load time: 4.0887

[2023-09-18T13:38:19.400Z] 2023-09-18 13:38:19.243 | INFO     | MainThread |utils:load_and_search:225 - {'metric_type': 'L2', 'params': {'nprobe': 10}}

[2023-09-18T13:38:19.400Z] 2023-09-18 13:38:19.243 | INFO     | MainThread |utils:load_and_search:228 - 

[2023-09-18T13:38:19.400Z] Search...

[2023-09-18T13:38:19.655Z] 2023-09-18 13:38:19.423 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 764, distance: 30.432262420654297, entity: {'count': 764, 'random_value': -18.0}

[2023-09-18T13:38:19.655Z] 2023-09-18 13:38:19.423 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2455, distance: 31.647565841674805, entity: {'count': 2455, 'random_value': -17.0}

[2023-09-18T13:38:19.655Z] 2023-09-18 13:38:19.423 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2424, distance: 32.878353118896484, entity: {'count': 2424, 'random_value': -17.0}

[2023-09-18T13:38:19.655Z] 2023-09-18 13:38:19.423 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2737, distance: 33.31123733520508, entity: {'count': 2737, 'random_value': -14.0}

[2023-09-18T13:38:19.655Z] Traceback (most recent call last):

[2023-09-18T13:38:19.655Z]   File "scripts/action_after_reinstall.py", line 47, in <module>

[2023-09-18T13:38:19.655Z]     task_2(data_size, host)

[2023-09-18T13:38:19.655Z]   File "scripts/action_after_reinstall.py", line 33, in task_2

[2023-09-18T13:38:19.655Z]     load_and_search(prefix)

[2023-09-18T13:38:19.655Z]   File "/home/jenkins/agent/workspace/tests/python_client/deploy/scripts/utils.py", line 241, in load_and_search

[2023-09-18T13:38:19.655Z]     assert len(ids) == topK, f"get {len(ids)} results, but topK is {topK}"

[2023-09-18T13:38:19.655Z] AssertionError: get 4 results, but topK is 5

failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/1446/pipeline

log: artifacts-kafka-standalone-reinstall-1450-pytest-logs.tar.gz artifacts-kafka-standalone-reinstall-1450-server-first-deployment-logs.tar.gz artifacts-kafka-standalone-reinstall-1450-server-second-deployment-logs.tar.gz

zhuwenxing commented 1 year ago

failed again failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/1446/pipeline log: artifacts-kafka-standalone-reinstall-1446-pytest-logs (1).tar.gz artifacts-kafka-standalone-reinstall-1446-server-first-deployment-logs (1).tar.gz artifacts-kafka-standalone-reinstall-1446-server-second-deployment-logs (1).tar.gz

[2023-09-18T13:38:15.243Z] 2023-09-18 13:38:15.140 | INFO     | MainThread |utils:load_and_search:257 - query latency: 0.0047s

[2023-09-18T13:38:15.243Z] 2023-09-18 13:38:15.140 | INFO     | MainThread |utils:load_and_search:259 - ###########

[2023-09-18T13:38:15.243Z] 2023-09-18 13:38:15.143 | INFO     | MainThread |utils:load_and_search:206 - collection name: task_2_IVF_FLAT

[2023-09-18T13:38:15.243Z] 2023-09-18 13:38:15.143 | INFO     | MainThread |utils:load_and_search:207 - load collection

[2023-09-18T13:38:19.400Z] 2023-09-18 13:38:19.232 | INFO     | MainThread |utils:load_and_search:211 - load time: 4.0887

[2023-09-18T13:38:19.400Z] 2023-09-18 13:38:19.243 | INFO     | MainThread |utils:load_and_search:225 - {'metric_type': 'L2', 'params': {'nprobe': 10}}

[2023-09-18T13:38:19.400Z] 2023-09-18 13:38:19.243 | INFO     | MainThread |utils:load_and_search:228 - 

[2023-09-18T13:38:19.400Z] Search...

[2023-09-18T13:38:19.655Z] 2023-09-18 13:38:19.423 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 764, distance: 30.432262420654297, entity: {'count': 764, 'random_value': -18.0}

[2023-09-18T13:38:19.655Z] 2023-09-18 13:38:19.423 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2455, distance: 31.647565841674805, entity: {'count': 2455, 'random_value': -17.0}

[2023-09-18T13:38:19.655Z] 2023-09-18 13:38:19.423 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2424, distance: 32.878353118896484, entity: {'count': 2424, 'random_value': -17.0}

[2023-09-18T13:38:19.655Z] 2023-09-18 13:38:19.423 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2737, distance: 33.31123733520508, entity: {'count': 2737, 'random_value': -14.0}

[2023-09-18T13:38:19.655Z] Traceback (most recent call last):

[2023-09-18T13:38:19.655Z]   File "scripts/action_after_reinstall.py", line 47, in <module>

[2023-09-18T13:38:19.655Z]     task_2(data_size, host)

[2023-09-18T13:38:19.655Z]   File "scripts/action_after_reinstall.py", line 33, in task_2

[2023-09-18T13:38:19.655Z]     load_and_search(prefix)

[2023-09-18T13:38:19.655Z]   File "/home/jenkins/agent/workspace/tests/python_client/deploy/scripts/utils.py", line 241, in load_and_search

[2023-09-18T13:38:19.655Z]     assert len(ids) == topK, f"get {len(ids)} results, but topK is {topK}"

[2023-09-18T13:38:19.655Z] AssertionError: get 4 results, but topK is 5
chyezh commented 1 year ago

I have reproduced the same problem with rocksmq in no-chaos environment.

In these test case, new 3000 vectors is always inserted with same primary key (field count) as existed vectors after reinstallation. image

On searching, there's one segment. Some vectors with same primary key in ivf index was returned from these segment, and was deduplicated at reduced time. d2a03593-72fd-4f90-9254-b0237b9839f5 It's expected case under current Milvus implementation, but not a bug. Please modify the test case to avoid duplicate primary key in these test.

/assign @zhuwenxing /unassign