milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.43k stars 2.82k forks source link

[Bug]: The first 4 search requests respnsed very slow after reinstall Milvus in 2+ days #14690

Closed yanliang567 closed 2 years ago

yanliang567 commented 2 years ago

Is there an existing issue for this?

Environment

- Milvus version:master-20211225-0b77c44
- Deployment mode(standalone or cluster): cluster
- SDK version(e.g. pymilvus v2.0.0rc2): rc9

mil-multi-0b77c44-milvus-datacoord-5fd6b48774-7hwmm             1/1     Running     0          51m
mil-multi-0b77c44-milvus-datanode-6cf7fb599c-j859z              1/1     Running     0          51m
mil-multi-0b77c44-milvus-indexcoord-6ddcf67955-dvtcq            1/1     Running     0          51m
mil-multi-0b77c44-milvus-indexnode-7bd4fcfcb-qmq6z              1/1     Running     0          51m
mil-multi-0b77c44-milvus-proxy-7677ff8947-9kv56                 1/1     Running     0          51m
mil-multi-0b77c44-milvus-querycoord-bc767b9d4-mhfpl             1/1     Running     0          51m
mil-multi-0b77c44-milvus-querynode-97d687cf4-jcjbx              1/1     Running     0          51m
mil-multi-0b77c44-milvus-querynode-97d687cf4-khp5p              1/1     Running     0          51m
mil-multi-0b77c44-milvus-rootcoord-b964b754b-8j9v2              1/1     Running     0          51m

Current Behavior

The first 4 searches become very slow after reinstall in 48+ hours for both existing collection and new created collections for an existing collection:

12/31/2021 06:57:29 AM - INFO - search perf test....
12/31/2021 06:57:29 AM - INFO - search_param: {'metric_type': 'L2', 'params': {'nprobe': 8}}
12/31/2021 06:57:29 AM - INFO - assert insert_nb50000_shards2_threads1_per500_t flushed num_entities 25000000: 0.006
12/31/2021 06:57:29 AM - INFO - assert load insert_nb50000_shards2_threads1_per500_t: 0.004
12/31/2021 06:57:29 AM - INFO - Start search nq1_top10_1threads_per2
12/31/2021 07:09:23 AM - INFO - assert search thread0 round0: 713.953
12/31/2021 07:14:49 AM - INFO - assert search thread0 round1: 325.974
12/31/2021 07:14:49 AM - INFO - Compete search nq1_top10_1threads_per2, cost 1039.9270505905151, QPS: 0.002, vectors_throughput: 0.002
12/31/2021 07:14:49 AM - INFO - assert load insert_nb50000_shards2_threads1_per500_t: 0.006
12/31/2021 07:14:49 AM - INFO - Start search nq1_top10_1threads_per2
12/31/2021 07:18:35 AM - INFO - assert search thread0 round0: 225.503
12/31/2021 07:18:44 AM - INFO - assert search thread0 round1: 9.941
12/31/2021 07:18:44 AM - INFO - Compete search nq1_top10_1threads_per2, cost 235.44476890563965, QPS: 0.008, vectors_throughput: 0.008
12/31/2021 07:18:45 AM - INFO - assert load insert_nb50000_shards2_threads1_per500_t: 0.005
12/31/2021 07:18:45 AM - INFO - Start search nq1_top10_1threads_per2
12/31/2021 07:18:45 AM - INFO - assert search thread0 round0: 0.968
12/31/2021 07:18:47 AM - INFO - assert search thread0 round1: 1.089
12/31/2021 07:18:47 AM - INFO - Compete search nq1_top10_1threads_per2, cost 2.057671546936035, QPS: 0.972, vectors_throughput: 0.972
12/31/2021 07:18:47 AM - INFO - assert load insert_nb50000_shards2_threads1_per500_t: 0.006
12/31/2021 07:18:47 AM - INFO - Start search nq1_top10_1threads_per2
12/31/2021 07:18:48 AM - INFO - assert search thread0 round0: 0.981
12/31/2021 07:18:49 AM - INFO - assert search thread0 round1: 1.011

Beside search, for a new created collection(UTC+8):

  1. insert became slower(from 1.5s to 14.6s), and keep slow even search to the exiting collection return to 1s
  2. flush became slower(from 2.6s to 1017.4s), and return to 2.6s after search to the existing collection return to 1s
    [2021-12-31 15:01:15,519 - INFO - ci_test]: [setup_method] Start setup test case test_milvus_default. (client_base.py:52)
    [2021-12-31 15:01:16,878 - INFO - ci_test]: assert create collection: 1.3129212856292725, init_entities: 0 (test_e2e.py:24)
    [2021-12-31 15:01:31,619 - INFO - ci_test]: assert insert: 14.64408016204834 (test_e2e.py:31)
    [2021-12-31 15:18:29,129 - INFO - ci_test]: assert flush: 1017.4635231494904, entities: 3000 (test_e2e.py:39)
    [2021-12-31 15:19:10,685 - INFO - ci_test]: assert search: 39.90831804275513 (test_e2e.py:50)
    [2021-12-31 15:19:29,505 - INFO - ci_test]: assert index entities: 6000 (test_e2e.py:57)
    [2021-12-31 15:19:34,986 - INFO - ci_test]: assert index: 5.480480909347534 (test_e2e.py:64)
    [2021-12-31 15:19:44,836 - INFO - ci_test]: assert load: 9.83054518699646 (test_e2e.py:71)
    [2021-12-31 15:19:45,604 - INFO - ci_test]: assert search: 0.7667279243469238 (test_e2e.py:78)
    [2021-12-31 15:19:46,220 - INFO - ci_test]: assert query result 3: 0.6155099868774414 (test_e2e.py:85)

Expected Behavior

Search responses as fast as it does before uninstall and reinstall

Steps To Reproduce

1. Deploy Milvus cluster 
2. create collection A and insert 10 million entities
3. load collection 
4. search(responses in 0.5-2 seconds)
5. uninstall Milvus with PVC remained
6. reinstall Milvus cluster with same config after 2+ days
7. do a same search

Anything else?

potential questions behind this issue:

  1. why insert of a new created collection is very slow, even when search to the existing collection return to 1s
  2. why flush of a new created collection is very slow, is it impacted by the existing collection, even there are no insert/flush ops to the exiting collection
  3. If the latency of 1st search to the existing collection was caused by loading, why the 2nd, 3rd, and 4th search keeps slow?
yanliang567 commented 2 years ago

based on the tests above the recovery time of Milvus seems to be (need more tests to verify): 600s(tt to latest) + time of load the existing collection(s)

xiaofan-luan commented 2 years ago

The large the milvus cluster the recovery time should be faster. For standalone it's gonna to be slow if we have many data in memory. For etcd lease notification and retrive data from log, I thought It should take less than 1 minutes~~

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.