Open akansha1812 opened 3 weeks ago
I am not sure what data flux is, is the related to storage? All List RPC today are compliant with https://google.aip.dev/158 and https://google.aip.dev/client-libraries/4233. This is all based on page_tokens that you need to do a fetch to get the next result.
@codyoss this will be a sub-package for storage similar to transfer manager, but focused on a few new features for AI/ML workloads.
To list large dataset in a GCS bucket sequential it takes a long time. If we can list objects in parallel, it will be much faster to complete listing.
Dataflux fast-listing will be used to list objects in a bucket in parallel using worksteal algorithm. It supports storage.Query to filter objects in a bucket and returns objects in batches. User can provide bucket, storage.Query and number of parallel worker and batch size.
There are different implementation for worksteal algorithm done and after benchmarking those, dataflux implementation came out faster.