milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.36k stars 2.91k forks source link

[Bug]: Query results length is not as expected after upgrading #20510

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: 2.1.4--> master-20221110-174310a1
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The code of query

        t0 = time.time()
        expr = "count in [2,4,6,8]"
        output_fields = ["count", "random_value"]
        res = c.query(expr, output_fields, timeout=120)
        sorted_res = sorted(res, key=lambda k: k['count'])
        for r in sorted_res:
            print(r)
        t1 = time.time()
        assert len(res) == 4

The query results length should be 4, but the length of the actual result is only 2

[2022-11-10T09:43:09.896Z] collection name: task_5_IVF_PQ
[2022-11-10T09:43:09.896Z] load collection
[2022-11-10T09:43:09.896Z] release collection before load if replicas > 1
[2022-11-10T09:43:09.896Z] release time: 2.4116
[2022-11-10T09:43:09.896Z] load time: 3.5330
[2022-11-10T09:43:09.896Z] Replica groups:
[2022-11-10T09:43:09.896Z] - Group: <group_id:437275234056798363>, <group_nodes:(22, 23)>, <shards:[Shard: <channel_name:by-dev-rootcoord-dml_130_437275061354758145v0>, <shard_leader:22>, <shard_nodes:[22, 22, 22, 22, 22]>, Shard: <channel_name:by-dev-rootcoord-dml_131_437275061354758145v1>, <shard_leader:23>, <shard_nodes:[23, 22, 22, 22, 22]>]>
[2022-11-10T09:43:09.896Z] - Group: <group_id:437275234056798364>, <group_nodes:(21,)>, <shards:[Shard: <channel_name:by-dev-rootcoord-dml_130_437275061354758145v0>, <shard_leader:21>, <shard_nodes:[21, 21, 21, 21, 21]>, Shard: <channel_name:by-dev-rootcoord-dml_131_437275061354758145v1>, <shard_leader:21>, <shard_nodes:[21, 21, 21, 21, 21]>]>
[2022-11-10T09:43:09.896Z] {'metric_type': 'L2', 'params': {'nprobe': 10}}
[2022-11-10T09:43:09.896Z] 
[2022-11-10T09:43:09.896Z] Search...
[2022-11-10T09:43:09.896Z] (distance: 30.237220764160156, id: 851) -20.0
[2022-11-10T09:43:09.896Z] (distance: 30.29398536682129, id: 2753) -15.0
[2022-11-10T09:43:09.896Z] (distance: 30.85041618347168, id: 816) -17.0
[2022-11-10T09:43:09.896Z] (distance: 31.285930633544922, id: 1850) -20.0
[2022-11-10T09:43:09.896Z] (distance: 31.405452728271484, id: 595) -14.0
[2022-11-10T09:43:09.896Z] [851, 2753, 816, 1850, 595]
[2022-11-10T09:43:09.896Z] search latency: 0.0098s
[2022-11-10T09:43:09.896Z] {'random_value': -14.0, 'count': 2}
[2022-11-10T09:43:09.896Z] {'random_value': -12.0, 'count': 4}
[2022-11-10T09:43:09.896Z] Traceback (most recent call last):
[2022-11-10T09:43:09.896Z]   File "scripts/action_after_upgrade.py", line 112, in <module>
[2022-11-10T09:43:09.896Z]     task_5(data_size, host)
[2022-11-10T09:43:09.896Z]   File "scripts/action_after_upgrade.py", line 89, in task_5
[2022-11-10T09:43:09.896Z]     load_and_search(prefix, replicas=NUM_REPLICAS)
[2022-11-10T09:43:09.896Z]   File "/home/jenkins/agent/workspace/tests/python_client/deploy/scripts/utils.py", line 217, in load_and_search
[2022-11-10T09:43:09.896Z]     assert len(res) == 4
[2022-11-10T09:43:09.896Z] AssertionError
script returned exit code 1

Expected Behavior

[2022-11-10T09:43:09.896Z] {'count': 2, 'random_value': -18.0}
[2022-11-10T09:43:09.896Z] {'count': 4, 'random_value': -17.0}
[2022-11-10T09:43:09.896Z] {'count': 6, 'random_value': -11.0}
[2022-11-10T09:43:09.896Z] {'count': 8, 'random_value': -14.0}
[2022-11-10T09:43:09.896Z] query latency: 0.0111s

Steps To Reproduce

No response

Milvus Log

image

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test/detail/deploy_test/623/pipeline/307 log: artifacts-pulsar-cluster-upgrade-623-server-logs.tar.gz artifacts-pulsar-cluster-upgrade-623-pytest-logs.tar.gz

Anything else?

No response

zhuwenxing commented 1 year ago

image

It failed at the step in the red block

yanliang567 commented 1 year ago

/assign @soothing-rain /unassign

xiaofan-luan commented 1 year ago

might be related with #20534

soothing-rain commented 1 year ago

/assign @MrPresent-Han

sre-ci-robot commented 1 year ago

@soothing-rain: GitHub didn't allow me to assign the following users: MrPresent-Han.

Note that only milvus-io members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to [this](https://github.com/milvus-io/milvus/issues/20510#issuecomment-1314714192): >/assign @MrPresent-Han > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
xiaocai2333 commented 1 year ago

MqilcfQzPd

zhuwenxing commented 1 year ago

MqilcfQzPd

What does this mean?

xiaocai2333 commented 1 year ago

MqilcfQzPd

What does this mean?

clue, debugging...

czs007 commented 1 year ago

@zhuwenxing

Please try again , it may have been fixed in master branch.

zhuwenxing commented 1 year ago

@zhuwenxing

Please try again , it may have been fixed in master branch.

Please link the fix PR!

zhuwenxing commented 1 year ago

@zhuwenxing

Please try again , it may have been fixed in master branch.

It is not a stable reproduced issue, so it is hard to verifying

MrPresent-Han commented 1 year ago

This issue may be due to time gap between query operation towards query node and load segments operation on query node. See images below, in which you can see that query request arrive at the node on 09:43:07.202, while the replica on this node completed loading on 09:43:07.733. This lead to the query request had an empty segments list, which is shown in the picture.

截屏2022-11-21 15 54 38 截屏2022-11-21 15 55 49
MrPresent-Han commented 1 year ago

The reason for why the previous search request got a correct complete result is that when search request arrived at query node-21, the time point is 09:43:07.190, while before that the segments 2546 and 4728 inside replica *8364 have been loaded completely on 09:43:05.701 and 09:43:06.008. And according to roundRobinPolicy running on the proxy, in the later query round, request will be routed to replica_8363, which encountered a relatively-delayed loading case described above.

截屏2022-11-21 16 26 26 截屏2022-11-21 16 27 10
yanliang567 commented 1 year ago

@MrPresent-Han if loading not completed, how search requests return success instead of return an error?

MrPresent-Han commented 1 year ago

@MrPresent-Han if loading not completed, how search requests return success instead of return an error?

This issue is related to replica, query, balance, load/reduce, relatively complicated. I will upload a document illustrating this problem in more detail lately

MrPresent-Han commented 1 year ago

This is an error incurred by improper removal of segments triggered by leader observer on querycoord, in which the next target of collection is not considered. This defect has been settled down by commit eaa5cfdcb5e9461d779d21b13d20e21389e26d2 on master branch. I add some log to trace load and remove action on querycoord for better debug process

zhuwenxing commented 1 year ago

Not reproduced