Open JSpenced opened 1 year ago
Thanks for sharing that. Bookmarking that class in case I need to take a closer look :) https://github.com/Netflix/metaflow/blob/246591595f21f68959386f65c84092ab6b0c2885/metaflow/plugins/datatools/s3/s3.py#L441
We'll probably be most interesting in optimizing our own read paths to be just as scalable as that. Part of that will be using larger block sizes (#1094) when we are doing scans. And part of that is figuring out how we can do a better job at making more parallel requests without hitting rate limits.
Yup, all the above makes sense. Metaflow comes from Netflix originally, so on the AWS side they know how to set limits and call aws optimally 👍 usually.
Thanks for all the responses!
We use
metaflow
for orchestration of many of our ML tasks, and they have something that if you specify a path in s3 will grab the data extremely fast as it's done in parallel. It can't do a lot of the things like lance that allow filtering or selection of specific columns, but I think the download method on lance backend could change dependant on what you are trying to do? This is the code normally used for metaflow and I think if a filtering clause isn't specified you could implement something similar:Just pointing it out because if you are outputting daily new fragments to append to the data. It's quite slow grabbing the whole dataset, where effectively the same operation using the code above is 20x~30x faster. This would get tricky with filtering or even specifying columns I think. Anyways, I can further point to the above code but just a possible future optimization especially if someone wants to pull down the whole dataset from blob storage. A caveat is this does download all those chunks in parallel to disk and then load them in, so you would need ample disk space.