milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.36k stars 2.91k forks source link

[Bug]: Search and query latency is large after upgrade v2.2.3 to 2.2.0-20230317-bbc21fe8 #22830

Closed zhuwenxing closed 1 year ago

zhuwenxing commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:2.2.0-20230317-bbc21fe8
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):kafka    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

image

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.135 | INFO     | MainThread |utils:load_and_search:246 - ###########

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.144 | INFO     | MainThread |utils:load_and_search:195 - collection name: task_3_IVF_FLAT

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.144 | INFO     | MainThread |utils:load_and_search:196 - load collection

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.148 | INFO     | MainThread |utils:load_and_search:200 - load time: 0.0044

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.165 | INFO     | MainThread |utils:load_and_search:214 - {'metric_type': 'L2', 'params': {'nprobe': 10}}

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.165 | INFO     | MainThread |utils:load_and_search:217 - 

[2023-03-17T08:20:04.593Z] Search...

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.174 | INFO     | MainThread |utils:load_and_search:229 - ['(distance: 32.55498504638672, id: 2523)', '(distance: 32.64250946044922, id: 2168)', '(distance: 33.395904541015625, id: 1586)', '(distance: 33.427040100097656, id: 2810)', '(distance: 33.531219482421875, id: 984)']

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.175 | INFO     | MainThread |utils:load_and_search:229 - ['(distance: 32.55498504638672, id: 2523)', '(distance: 32.64250946044922, id: 2168)', '(distance: 33.395904541015625, id: 1586)', '(distance: 33.427040100097656, id: 2810)', '(distance: 33.531219482421875, id: 984)']

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.175 | INFO     | MainThread |utils:load_and_search:229 - ['(distance: 32.55498504638672, id: 2523)', '(distance: 32.64250946044922, id: 2168)', '(distance: 33.395904541015625, id: 1586)', '(distance: 33.427040100097656, id: 2810)', '(distance: 33.531219482421875, id: 984)']

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.175 | INFO     | MainThread |utils:load_and_search:229 - ['(distance: 32.55498504638672, id: 2523)', '(distance: 32.64250946044922, id: 2168)', '(distance: 33.395904541015625, id: 1586)', '(distance: 33.427040100097656, id: 2810)', '(distance: 33.531219482421875, id: 984)']

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.175 | INFO     | MainThread |utils:load_and_search:229 - ['(distance: 32.55498504638672, id: 2523)', '(distance: 32.64250946044922, id: 2168)', '(distance: 33.395904541015625, id: 1586)', '(distance: 33.427040100097656, id: 2810)', '(distance: 33.531219482421875, id: 984)']

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.176 | INFO     | MainThread |utils:load_and_search:232 - [2523, 2168, 1586, 2810, 984]

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.176 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0088s

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.181 | INFO     | MainThread |utils:load_and_search:241 - {'count': 2, 'random_value': -17.0}

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.181 | INFO     | MainThread |utils:load_and_search:241 - {'count': 4, 'random_value': -11.0}

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.181 | INFO     | MainThread |utils:load_and_search:241 - {'count': 6, 'random_value': -14.0}

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.181 | INFO     | MainThread |utils:load_and_search:241 - {'count': 8, 'random_value': -13.0}

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.181 | INFO     | MainThread |utils:load_and_search:244 - query latency: 0.0054s

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.181 | INFO     | MainThread |utils:load_and_search:246 - ###########

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.186 | INFO     | MainThread |utils:load_and_search:195 - collection name: task_3_ANNOY

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.186 | INFO     | MainThread |utils:load_and_search:196 - load collection

[2023-03-17T08:20:04.593Z] 2023-03-17 08:20:00.191 | INFO     | MainThread |utils:load_and_search:200 - load time: 0.0046

[2023-03-17T08:20:04.594Z] 2023-03-17 08:20:00.214 | INFO     | MainThread |utils:load_and_search:214 - {'metric_type': 'L2', 'params': {'search_k': 1000}}

[2023-03-17T08:20:04.594Z] 2023-03-17 08:20:00.214 | INFO     | MainThread |utils:load_and_search:217 - 

[2023-03-17T08:20:04.594Z] Search...

[2023-03-17T08:21:40.961Z] 2023-03-17 08:21:29.411 | INFO     | MainThread |utils:load_and_search:229 - ['(distance: 5.619114398956299, id: 2266)', '(distance: 5.627915859222412, id: 1309)', '(distance: 5.637307167053223, id: 1179)', '(distance: 5.658819198608398, id: 2299)', '(distance: 5.674744129180908, id: 2642)']

[2023-03-17T08:21:40.961Z] 2023-03-17 08:21:29.412 | INFO     | MainThread |utils:load_and_search:229 - ['(distance: 5.619114398956299, id: 2266)', '(distance: 5.627915859222412, id: 1309)', '(distance: 5.637307167053223, id: 1179)', '(distance: 5.658819198608398, id: 2299)', '(distance: 5.674744129180908, id: 2642)']

[2023-03-17T08:21:40.961Z] 2023-03-17 08:21:29.412 | INFO     | MainThread |utils:load_and_search:229 - ['(distance: 5.619114398956299, id: 2266)', '(distance: 5.627915859222412, id: 1309)', '(distance: 5.637307167053223, id: 1179)', '(distance: 5.658819198608398, id: 2299)', '(distance: 5.674744129180908, id: 2642)']

[2023-03-17T08:21:40.961Z] 2023-03-17 08:21:29.412 | INFO     | MainThread |utils:load_and_search:229 - ['(distance: 5.619114398956299, id: 2266)', '(distance: 5.627915859222412, id: 1309)', '(distance: 5.637307167053223, id: 1179)', '(distance: 5.658819198608398, id: 2299)', '(distance: 5.674744129180908, id: 2642)']

[2023-03-17T08:21:40.961Z] 2023-03-17 08:21:29.413 | INFO     | MainThread |utils:load_and_search:229 - ['(distance: 5.619114398956299, id: 2266)', '(distance: 5.627915859222412, id: 1309)', '(distance: 5.637307167053223, id: 1179)', '(distance: 5.658819198608398, id: 2299)', '(distance: 5.674744129180908, id: 2642)']

[2023-03-17T08:21:40.961Z] 2023-03-17 08:21:29.413 | INFO     | MainThread |utils:load_and_search:232 - [2266, 1309, 1179, 2299, 2642]

[2023-03-17T08:21:40.961Z] 2023-03-17 08:21:29.413 | INFO     | MainThread |utils:load_and_search:234 - search latency: 89.1968s

[2023-03-17T08:23:32.362Z] RPC error: [query], <MilvusException: (code=1, message=rpc deadline exceeded: Retry timeout: 120s)>, <Time:{'RPC start': '2023-03-17 08:21:29.413533', 'RPC error': '2023-03-17 08:23:29.415039'}>

[2023-03-17T08:23:32.362Z] Traceback (most recent call last):

[2023-03-17T08:23:32.362Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 50, in handler

[2023-03-17T08:23:32.362Z]     return func(self, *args, **kwargs)

[2023-03-17T08:23:32.362Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 945, in query

[2023-03-17T08:23:32.362Z]     response = future.result()

[2023-03-17T08:23:32.362Z]   File "/usr/local/lib/python3.8/dist-packages/grpc/_channel.py", line 744, in result

[2023-03-17T08:23:32.362Z]     raise self

[2023-03-17T08:23:32.362Z] grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:

[2023-03-17T08:23:32.362Z]  status = StatusCode.DEADLINE_EXCEEDED

[2023-03-17T08:23:32.362Z]  details = "Deadline Exceeded"

[2023-03-17T08:23:32.362Z]  debug_error_string = "{"created":"@1679041409.413984704","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"

[2023-03-17T08:23:32.362Z] >

[2023-03-17T08:23:32.362Z] 

[2023-03-17T08:23:32.362Z] The above exception was the direct cause of the following exception:

[2023-03-17T08:23:32.362Z] 

[2023-03-17T08:23:32.362Z] Traceback (most recent call last):

[2023-03-17T08:23:32.362Z]   File "scripts/action_after_upgrade.py", line 111, in <module>

[2023-03-17T08:23:32.362Z]     task_3(data_size, host)

[2023-03-17T08:23:32.362Z]   File "scripts/action_after_upgrade.py", line 54, in task_3

[2023-03-17T08:23:32.362Z]     load_and_search(prefix)

[2023-03-17T08:23:32.362Z]   File "/home/jenkins/agent/workspace/tests/python_client/deploy/scripts/utils.py", line 238, in load_and_search

[2023-03-17T08:23:32.362Z]     res = c.query(expr, output_fields, timeout=120)

[2023-03-17T08:23:32.362Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 744, in query

[2023-03-17T08:23:32.362Z]     res = conn.query(self._name, expr, output_fields, partition_names,

[2023-03-17T08:23:32.362Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 109, in handler

[2023-03-17T08:23:32.362Z]     raise e

[2023-03-17T08:23:32.362Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 105, in handler

[2023-03-17T08:23:32.362Z]     return func(*args, **kwargs)

[2023-03-17T08:23:32.362Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 136, in handler

[2023-03-17T08:23:32.362Z]     ret = func(self, *args, **kwargs)

[2023-03-17T08:23:32.362Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 64, in handler

[2023-03-17T08:23:32.362Z]     raise MilvusException(message=f"rpc deadline exceeded: {timeout_msg}") from e

[2023-03-17T08:23:32.362Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=rpc deadline exceeded: Retry timeout: 120s)>

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

milvus mode: cluster deploy task: upgrade old image tag: v2.2.3 new image tag: 2.2.0-20230317-bbc21fe8 failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/589/pipeline log: artifacts-kafka-cluster-upgrade-589-server-first-deployment-logs.tar.gz artifacts-kafka-cluster-upgrade-589-server-second-deployment-logs.tar.gz artifacts-kafka-cluster-upgrade-589-pytest-logs.tar.gz

Anything else?

No response

zhuwenxing commented 1 year ago

It is not a stable issue. The success one: image

xiaofan-luan commented 1 year ago

/assign @longjiquan

yanliang567 commented 1 year ago

/unassign

longjiquan commented 1 year ago

image

This log indicates that a query on sealed segment times out.

longjiquan commented 1 year ago

image

1, The query on growing segment was done; 2, The query on sealed segment entered the queue;

longjiquan commented 1 year ago

image

Line 16772 is the place where the query on growing segment happened. After 16772, there is no more log which is related to both tsafe and this collection.

longjiquan commented 1 year ago

So I guess the timetick on the delta channel was not updated in time.

longjiquan commented 1 year ago

image

TimeTick in datanode lags a lot.

longjiquan commented 1 year ago

image

However, the metrics in rootcoord, the producer of mq, is normal.

longjiquan commented 1 year ago

So I guess this issue is just caused by the unstable network issues. Slow network caused datanodes, the consumer of mq, consumed the message slowly.

zhuwenxing commented 1 year ago

The search latency is expected when the new image tag is 2.2.0-20230321-bc8fca80

[2023-03-21T12:56:27.470Z] 2023-03-21 12:56:26.840 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0079s
[2023-03-21T12:56:27.471Z] 2023-03-21 12:56:26.884 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0069s
[2023-03-21T12:56:27.471Z] 2023-03-21 12:56:26.922 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0072s
[2023-03-21T12:56:27.471Z] 2023-03-21 12:56:26.962 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0082s
[2023-03-21T12:56:27.472Z] 2023-03-21 12:56:27.004 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0077s
[2023-03-21T12:56:27.472Z] 2023-03-21 12:56:27.047 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0073s
[2023-03-21T12:57:10.545Z] 2023-03-21 12:57:09.922 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0092s
[2023-03-21T12:57:15.164Z] 2023-03-21 12:57:14.875 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0146s
[2023-03-21T12:57:19.574Z] 2023-03-21 12:57:18.912 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0081s
[2023-03-21T12:57:21.712Z] 2023-03-21 12:57:21.473 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0079s
[2023-03-21T12:57:26.469Z] 2023-03-21 12:57:25.974 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0104s
[2023-03-21T12:57:29.733Z] 2023-03-21 12:57:29.551 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0091s
[2023-03-21T12:57:29.989Z] 2023-03-21 12:57:29.950 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0083s
[2023-03-21T12:57:30.243Z] 2023-03-21 12:57:30.008 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0100s
[2023-03-21T12:57:30.244Z] 2023-03-21 12:57:30.051 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0080s
[2023-03-21T12:57:30.244Z] 2023-03-21 12:57:30.089 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0071s
[2023-03-21T12:57:30.244Z] 2023-03-21 12:57:30.134 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0099s
[2023-03-21T12:57:30.245Z] 2023-03-21 12:57:30.177 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0081s
[2023-03-21T12:58:17.811Z] 2023-03-21 12:58:16.617 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0081s
[2023-03-21T12:58:21.070Z] 2023-03-21 12:58:20.725 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0121s
[2023-03-21T12:58:24.334Z] 2023-03-21 12:58:23.827 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0098s
[2023-03-21T12:58:28.961Z] 2023-03-21 12:58:28.755 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0086s
[2023-03-21T12:58:32.221Z] 2023-03-21 12:58:31.850 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0108s
[2023-03-21T12:58:35.486Z] 2023-03-21 12:58:34.925 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0114s
[2023-03-21T12:58:35.487Z] 2023-03-21 12:58:35.163 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0350s
[2023-03-21T12:58:35.487Z] 2023-03-21 12:58:35.219 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0097s
[2023-03-21T12:58:35.487Z] 2023-03-21 12:58:35.289 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0121s
[2023-03-21T12:58:37.995Z] 2023-03-21 12:58:37.733 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0088s
[2023-03-21T12:58:37.996Z] 2023-03-21 12:58:37.805 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0073s
[2023-03-21T12:58:37.996Z] 2023-03-21 12:58:37.848 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0092s
[2023-03-21T12:59:24.314Z] 2023-03-21 12:59:23.642 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0079s
[2023-03-21T12:59:26.823Z] 2023-03-21 12:59:26.810 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0084s
[2023-03-21T12:59:32.869Z] 2023-03-21 12:59:32.642 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0100s
[2023-03-21T12:59:36.130Z] 2023-03-21 12:59:35.773 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0145s
[2023-03-21T12:59:39.389Z] 2023-03-21 12:59:38.874 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0082s
[2023-03-21T12:59:44.476Z] 2023-03-21 12:59:44.216 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0109s
[2023-03-21T12:59:47.733Z] 2023-03-21 12:59:47.555 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0087s
[2023-03-21T12:59:50.989Z] 2023-03-21 12:59:50.752 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0090s
[2023-03-21T12:59:54.246Z] 2023-03-21 12:59:53.893 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0085s
[2023-03-21T12:59:56.757Z] 2023-03-21 12:59:56.476 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0094s
[2023-03-21T13:00:00.914Z] 2023-03-21 13:00:00.584 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0087s
[2023-03-21T13:00:04.175Z] 2023-03-21 13:00:03.676 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0079s
[2023-03-21T13:00:24.215Z] 2023-03-21 13:00:23.707 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0079s
[2023-03-21T13:00:29.359Z] 2023-03-21 13:00:28.810 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0069s
[2023-03-21T13:00:41.794Z] 2023-03-21 13:00:41.600 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0082s
[2023-03-21T13:00:50.214Z] 2023-03-21 13:00:49.703 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0078s
[2023-03-21T13:00:58.619Z] 2023-03-21 13:00:57.809 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0078s
[2023-03-21T13:01:04.143Z] 2023-03-21 13:01:03.709 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0099s
[2023-03-21T13:01:17.471Z] 2023-03-21 13:01:16.695 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0063s
[2023-03-21T13:01:20.012Z] 2023-03-21 13:01:19.769 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0072s
[2023-03-21T13:01:23.272Z] 2023-03-21 13:01:22.861 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0071s
[2023-03-21T13:01:25.781Z] 2023-03-21 13:01:25.443 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0067s
[2023-03-21T13:01:29.038Z] 2023-03-21 13:01:28.511 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0069s
[2023-03-21T13:01:32.294Z] 2023-03-21 13:01:31.592 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0077s
[2023-03-21T13:01:58.114Z] 2023-03-21 13:01:57.601 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0076s
[2023-03-21T13:02:03.301Z] 2023-03-21 13:02:02.682 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0080s
[2023-03-21T13:02:08.489Z] 2023-03-21 13:02:08.447 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0072s
[2023-03-21T13:02:14.373Z] 2023-03-21 13:02:13.731 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0067s
[2023-03-21T13:02:20.138Z] 2023-03-21 13:02:19.621 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0066s
[2023-03-21T13:02:25.283Z] 2023-03-21 13:02:24.709 | INFO     | MainThread |utils:load_and_search:234 - search latency: 0.0075s

search.log

job name: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/610/pipeline/338