lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.99k stars 230 forks source link

Expose `with_retry` in storage options? #3182

Open oceanusxiv opened 1 week ago

oceanusxiv commented 1 week ago

The underlying object_store crate being used supports setting a with_retry configuration which is useful for exponential backoff and jitter when you have temporary network outages. It should be exposed to the user via the storage_options API (or some other API) so it can be set, as it stands I don't think there's any exponential backoff to the download retries?

westonpace commented 1 week ago

Correct, there is no exponential backoff to the download retries (though it depends on your definition of exponential). However, I'm not sure that object_store is the place to configure retry backoffs due to network outages. What sort of max duration are you looking for?

If you are looking for something over 5 minutes then you will encounter this warning from object_store:

As requests are retried without renewing credentials or regenerating request payloads, this number should be kept below 5 minutes to avoid errors due to expired credentials and/or request payloads

If you are looking for something less than 5 minutes then you can probably get there by exposing with_retry in some way. It should be a fairly straightforward change. Probably the simplest thing to expose would be init_backoff. I'd advise anyone working on this to read up on the actual algorithm used which is "decorrelated jitter" and not "classic exponential growth". It is designed to avoid waves of concurrent requests and not solve network outages. Its growth is sub-linear.

We do have an outer retry loop that we use in most places which can be configured with (sadly not documented download_retry_count but this only applies to the download of the data and not the initial transmission of headers).

So, if we want a retry loop for intermittent network timeouts it probably needs to be a new retry loop. I'd be open to the idea but also slightly cautious as this feels like something not all users will need and the users that do can build their own retry loop outside of Lance.