lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.99k stars 228 forks source link

perf bug: Inserting data is O(num versions) #2318

Open wjones127 opened 6 months ago

wjones127 commented 6 months ago

It appears the time to write data scales linearly with the number of versions. This is not great. On my local computer, it starts off at 10 ms and after a few thousand versions becomes 30 ms. For a higher-latency store, I bet this is more dramatic. One user reported latency of 1.5 sec after 8k versions.

My best guess is this is because to load the latest version we are listing all files in versions directory. We might have to implement the first part of #1362 to fix this.

Reproduce this

```python from datetime import timedelta import time import pyarrow as pa import lance data = pa.table({'a': pa.array([1])}) # Uncomment this part to reset and see once we delete versions, the latency # goes back down. # ds = lance.dataset("test_data") # ds.cleanup_old_versions(older_than=timedelta(seconds=1), delete_unverified=True) for i in range(10000): start = time.monotonic() # Use overwrite to eliminate possibility that it is O(num files) lance.write_dataset(data, 'test_data', mode='overwrite') print(time.monotonic() - start) ```

wjones127 commented 6 months ago

This should be substantially mitigated by #2396. However, there is still some optimization that can be made in stores that support list start-after (GCS, S3). This optimization won't be useful for other ones like local file systems or Azure, so it's unclear whether it is worthwhile. It may be more worthwhile to invest in auto-cleanup so users don't accumulate so many versions in the first place.