lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.99k stars 230 forks source link

Grabbing whole dataset from s3 currently slow (possibly other cloud blob storage) #1215

Open JSpenced opened 1 year ago

JSpenced commented 1 year ago

We use metaflow for orchestration of many of our ML tasks, and they have something that if you specify a path in s3 will grab the data extremely fast as it's done in parallel. It can't do a lot of the things like lance that allow filtering or selection of specific columns, but I think the download method on lance backend could change dependant on what you are trying to do? This is the code normally used for metaflow and I think if a filtering clause isn't specified you could implement something similar:

with S3(s3root=f"{source_s3_base_path}") as s3:
        files = s3.get_all()
        return pd.concat([pd.read_parquet(file.path) for file in files]) 

Just pointing it out because if you are outputting daily new fragments to append to the data. It's quite slow grabbing the whole dataset, where effectively the same operation using the code above is 20x~30x faster. This would get tricky with filtering or even specifying columns I think. Anyways, I can further point to the above code but just a possible future optimization especially if someone wants to pull down the whole dataset from blob storage. A caveat is this does download all those chunks in parallel to disk and then load them in, so you would need ample disk space.

wjones127 commented 1 year ago

Thanks for sharing that. Bookmarking that class in case I need to take a closer look :) https://github.com/Netflix/metaflow/blob/246591595f21f68959386f65c84092ab6b0c2885/metaflow/plugins/datatools/s3/s3.py#L441

We'll probably be most interesting in optimizing our own read paths to be just as scalable as that. Part of that will be using larger block sizes (#1094) when we are doing scans. And part of that is figuring out how we can do a better job at making more parallel requests without hitting rate limits.

JSpenced commented 1 year ago

Yup, all the above makes sense. Metaflow comes from Netflix originally, so on the AWS side they know how to set limits and call aws optimally 👍 usually.

Thanks for all the responses!