lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.78k stars 207 forks source link

Limit parallelism in `Dataset.cleanup_old_versions` #2805

Open tonyf opened 2 weeks ago

tonyf commented 2 weeks ago

Running into s3 rate limits when trying to cleanup a very large dataset with dataset.cleanup_old_versions. Can't seem to control this via LANCE_IO_THREADS

wjones127 commented 2 weeks ago

This could be due to hard-coded concurrency in object_store: https://github.com/apache/arrow-rs/blob/a937869f892dc12c4730189e216bf3bd48c2561d/object_store/src/aws/mod.rs#L252

We might need to make this controllable upstream somehow.

tonyf commented 2 weeks ago

Hm, is there any way to temporarily monkeypatch rust-level code in python?

westonpace commented 2 weeks ago

Actually, we don't use delete_stream (mainly by chance) so we probably don't need to worry about object_store. I suspect this is fixed in 0.17.0b9 (released yesterday) via https://github.com/lancedb/lance/pull/2773

We were previously using num_cpus::get and now are using LANCE_IO_THREADS.

wjones127 commented 2 weeks ago

Actually, we don't use delete_stream (mainly by chance) so we probably don't need to worry about object_store.

What makes you say that? I see us call remove_stream here:

https://github.com/lancedb/lance/blob/2f25fc473dd69c8bc298c4f4e171b81f87660656/rust/lance/src/dataset/cleanup.rs#L289

Which dispatches to delete_stream here:

https://github.com/lancedb/lance/blob/2f25fc473dd69c8bc298c4f4e171b81f87660656/rust/lance-io/src/object_store.rs#L614-L615

tonyf commented 2 weeks ago

I'm now getting

OSError: LanceError(IO): Generic S3 error: Got invalid DeleteObjects response: unknown variant `Code`, expected `Deleted` or `Error`, /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/fns.rs:368:13

Maybe this is happening because a previous cleanup operation failed without marking the version as deleted so it's getting a not found? Not sure how to work around this.

westonpace commented 2 weeks ago

What makes you say that? I see us call remove_stream here:

Ah, I was just searching for delete_stream and saw the parallelism on old_manifests and assumed that was it. My mistake.

westonpace commented 2 weeks ago

OSError: LanceError(IO): Generic S3 error: Got invalid DeleteObjects response: unknown variant Code, expected Deleted or Error, /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/fns.rs:368:13

That's a new one for me. Seems almost like a malformed S3 response.