googleapis / google-cloud-go

Google Cloud Client Libraries for Go.
https://cloud.google.com/go/docs/reference
Apache License 2.0
3.72k stars 1.27k forks source link

storage: implement dataflux fast listing #10731

Open akansha1812 opened 3 weeks ago

akansha1812 commented 3 weeks ago

To list large dataset in a GCS bucket sequential it takes a long time. If we can list objects in parallel, it will be much faster to complete listing.

Dataflux fast-listing will be used to list objects in a bucket in parallel using worksteal algorithm. It supports storage.Query to filter objects in a bucket and returns objects in batches. User can provide bucket, storage.Query and number of parallel worker and batch size.

There are different implementation for worksteal algorithm done and after benchmarking those, dataflux implementation came out faster.

codyoss commented 3 weeks ago

I am not sure what data flux is, is the related to storage? All List RPC today are compliant with https://google.aip.dev/158 and https://google.aip.dev/client-libraries/4233. This is all based on page_tokens that you need to do a fetch to get the next result.

tritone commented 3 weeks ago

@codyoss this will be a sub-package for storage similar to transfer manager, but focused on a few new features for AI/ML workloads.