qdrant / vector-db-benchmark

Framework for benchmarking vector search engines
https://qdrant.tech/benchmarks/
Apache License 2.0
266 stars 76 forks source link

backoff strategy should be used for rate-limited errors on milvus or reducing batch_size config #54

Open filipecosta90 opened 1 year ago

filipecosta90 commented 1 year ago

It's recurrent to see the following type of errors on non-local setups:

pymilvus.exceptions.MilvusException: <MilvusException: (code=49, message=Retry run out of 10 retry times, message=request is rejected by grpc RateLimiter middleware, please retry later, req: /milvus.proto.milvus.MilvusService/Insert)>

Full traceback:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/ubuntu/vector-db-benchmark/engine/base_client/upload.py", line 90, in _upload_batch
    cls.upload_batch(ids, vectors, metadata)
  File "/home/ubuntu/vector-db-benchmark/engine/clients/milvus/upload.py", line 68, in upload_batch
    cls.upload_with_backoff(field_values, ids, vectors)
  File "/usr/local/lib/python3.10/dist-packages/backoff/_sync.py", line 105, in retry
    ret = target(*args, **kwargs)
  File "/home/ubuntu/vector-db-benchmark/engine/clients/milvus/upload.py", line 75, in upload_with_backoff
    cls.collection.insert([ids, vectors] + field_values)
  File "/usr/local/lib/python3.10/dist-packages/pymilvus/orm/collection.py", line 443, in insert
    res = conn.batch_insert(self._name, entities, partition_name,
  File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 109, in handler
    raise e
  File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 105, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 136, in handler
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pymilvus/decorators.py", line 80, in handler
    raise MilvusException(e.code, f"{timeout_msg}, message={e.message}") from e
pymilvus.exceptions.MilvusException: <MilvusException: (code=49, message=Retry run out of 10 retry times, message=request is rejected by grpc RateLimiter middleware, please retry later, req: /milvus.proto.milvus.MilvusService/Insert)>
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/home/ubuntu/vector-db-benchmark/run.py", line 79, in <module>
    app()

  File "/home/ubuntu/vector-db-benchmark/run.py", line 74, in run
    raise e

  File "/home/ubuntu/vector-db-benchmark/run.py", line 52, in run
    client.run_experiment(dataset, skip_upload, skip_search)

  File "/home/ubuntu/vector-db-benchmark/engine/base_client/client.py", line 70, in run_experiment
    upload_stats = self.uploader.upload(

  File "/home/ubuntu/vector-db-benchmark/engine/base_client/upload.py", line 56, in upload
    latencies = list(

  File "/usr/lib/python3.10/multiprocessing/pool.py", line 873, in next
    raise value

pymilvus.exceptions.MilvusException: <MilvusException: (code=49, message=Retry run out of 10 retry times, message=request is rejected by grpc RateLimiter middleware, please retry later, req: /milvus.proto.milvus.MilvusService/Insert)>

Given milvus configs dont specify the batch_size we're using 64 vectors, which seems to be constantly making the error state above. I suggest to either respect API Rate Limits With a Backoff or reduce the batch size.

KShivendu commented 8 months ago

errors on non-local setups

Are you testing Ziliz's cloud offering?

wangting0128 commented 2 months ago

Hello, are you running the test on zilliz cloud? Can you provide the instance specifications you used?

filipecosta90 commented 2 months ago

Hello, are you running the test on zilliz cloud?

@wangting0128 yes.

Can you provide the instance specifications you used?

Sure. I've used the Dedicated Performance Optimized CU size 1 (issue happens on large CUs as well). I've confirmed yesterday it still happens:

MILVUS_USER="db_admin" MILVUS_PASS="<...>" MILVUS_PORT=<...> python3 run.py  --engines milvus-m-* --datasets gist-960-euclidean --host <...>

(...)
(...)
Running experiment: milvus-m-16-ef-64 - gist-960-euclidean
established connection
/home/ubuntu/vector-db-benchmark/datasets/gist-960-euclidean/gist-960-euclidean.hdf5 already exists
Experiment stage: Configure
Experiment stage: Upload
644800it [09:51, 1120.07it/s][batch_insert] retry:8, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, Broken pipe>
649664it [09:55, 1204.76it/s][batch_insert] retry:9, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, Broken pipe>
1000000it [15:16, 1090.80it/s]
Upload time: 919.8683542869985
Total import time: 1126.062087302911
Experiment stage: Search
(...)

Notice that after around 10minutes of ingestion zilliz cloud "breaks" and we need 8 and 9 retries to complete that batch insert. I've preserved the full log of all variations in case we need it for the future.

@wangting0128 notice that I've added a backoff strategy capacity to the tool to ensure we can properly handle this issues and benchmark with the correct conditions. I'll open a PR just for the zilliz cloud benchmarking still today.

wangting0128 commented 1 month ago

Hello, are you running the test on zilliz cloud?

@wangting0128 yes.

Can you provide the instance specifications you used?

Sure. I've used the Dedicated Performance Optimized CU size 1 (issue happens on large CUs as well). I've confirmed yesterday it still happens:

MILVUS_USER="db_admin" MILVUS_PASS="<...>" MILVUS_PORT=<...> python3 run.py  --engines milvus-m-* --datasets gist-960-euclidean --host <...>

(...)
(...)
Running experiment: milvus-m-16-ef-64 - gist-960-euclidean
established connection
/home/ubuntu/vector-db-benchmark/datasets/gist-960-euclidean/gist-960-euclidean.hdf5 already exists
Experiment stage: Configure
Experiment stage: Upload
644800it [09:51, 1120.07it/s][batch_insert] retry:8, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, Broken pipe>
649664it [09:55, 1204.76it/s][batch_insert] retry:9, cost: 3.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, Broken pipe>
1000000it [15:16, 1090.80it/s]
Upload time: 919.8683542869985
Total import time: 1126.062087302911
Experiment stage: Search
(...)

Notice that after around 10minutes of ingestion zilliz cloud "breaks" and we need 8 and 9 retries to complete that batch insert. I've preserved the full log of all variations in case we need it for the future.

@wangting0128 notice that I've added a backoff strategy capacity to the tool to ensure we can properly handle this issues and benchmark with the correct conditions. I'll open a PR just for the zilliz cloud benchmarking still today.

Hi, sorry for replying to your message now.

Based on your problem description, I have some information to share with you~:

  1. Index and search parameters used by zilliz cloud You can refer to the official documentation. Correspondingly, we have provided a PR with a set of configurations. Please review it. PR: The provided configuration file only contains one set of milvus-cloud configuration, because I don't know whether you are running the same configuration on all datasets :>
  2. There are insertion limits for instances of different specifications in zilliz cloud, so you may encounter insertion errors. For specific limit values, please refer to the documentation
  3. Based on the instance specifications described in blog, we recommend that you use Dedicated Performance Optimized CU size 4 of zilliz cloud :>

If you have any further questions, please feel free to contact us. Thank you very much~