iterative / dvc

🦉 Data Versioning and ML Experiments
https://dvc.org
Apache License 2.0
13.87k stars 1.19k forks source link

cloud/fs concurrency for large files #9893

Open dberenbaum opened 1 year ago

dberenbaum commented 1 year ago

@shcheklein in that case our concurrency level will be jobs * jobs, which is generally going to be way too high in the default case. I also considered splitting jobs between the two (so batch_size=(min(1, jobs // 2)) and the same for max_concurrency) but that will make us perform a lot worse in the cases where you are pushing a large # of files that are smaller than the chunk size

I think it will be worth revisiting this to properly determine what level of concurrency we should be using at both the file and chunk level, but that is dependent on the number of files being transferred, the size of all of those files, and the chunk size for the given cloud. This is all work that we can do at some point, but in the short term I prioritized getting a fix for the worst case scenario for azure (pushing a single large file).

Also, any work that we do on this right now would only work for Azure, since right now adlfs is the only underlying fsspec implementation that actually does concurrent chunked/multipart upload/downloads . It would be better for us to contribute upstream to make the s3/gcs/etc implementations support the chunk/multipart concurrency first, before we get into trying to make DVC optimize balancing file and chunk level concurrency

Originally posted by @pmrowla in https://github.com/iterative/dvc-objects/issues/218#issuecomment-1672435786

Tasks

- [x] S3: https://github.com/fsspec/s3fs/pull/848
- [ ] GCS
pmrowla commented 8 months ago

Regarding GCS, the default method of uploading files in chunks for GCS requires that chunks be uploaded in order sequentially, which is what gcsfs does now (see chunked/streaming uploads in the GCS docs: https://cloud.google.com/storage/docs/performing-resumable-uploads#chunked-upload)

There is a separate XML API which allows concurrent uploads (like S3 multipart uploads) but it introduces some limitations/constraints on the GCS bucket and it requires a different set of IAM permissions than regular uploads (see https://cloud.google.com/storage/docs/multipart-uploads). The main limitation is that MD5's are no longer available for those objects, but ETags are still guaranteed to change whenever the object content or metadata changes for an object, which is sufficient for DVC purposes.

None of the bucket limitations should affect DVC directly, so we could contribute support for this as optional behavior in gcsfs, but I'm not sure it's worth it for DVC given that it adds the requirement for non-standard IAM roles, and that users may be expecting the bucket limitations for files uploaded by DVC.

cc @dberenbaum

dberenbaum commented 8 months ago

Agreed that we can limit support to AWS and Azure for now

pmrowla commented 8 months ago

The s3 changes have been merged upstream and will be available in the next s3fs release. I'll leave this issue open since it's still not addressed for google cloud storage, but this should no longer be a p1