nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.68k stars 621 forks source link

Copy method does not check object size and chunksize #3113

Closed cjw85 closed 1 year ago

cjw85 commented 2 years ago

Multipart uploads may have up to 10000 parts: https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html

The code here: S3FileSystemProvider.java#L549-L576 does not appear to check that the number of parts will exceed this limit given the known object size and chunk size.

Either the method should immediately throw an exception, or choose a suitable chunk size such that the copy can proceed without error.

pditommaso commented 2 years ago

Unfortunately, this cannot be done, because this setting is defined at the level of the transfer manager for all uploads.

https://github.com/nextflow-io/nextflow/blob/master/plugins/nf-amazon/src/main/com/upplication/s3fs/AmazonS3Client.java#L483-L483

Not aware of a way to define it independently for each transfer.

However having an early error, checking it ahead of the transfer, it looks like a good suggestion

cjw85 commented 2 years ago

😬 Wow, thats bad!

So we have to simply set the chunk size to the maximum the maximum filesize we might expect divided by 10,000? The new default of 100MB means we can transfer almost a 1TB, but its at the expense of no multipart uploads for smaller files.

pditommaso commented 2 years ago

That's how AWS SDK works

bentsherman commented 2 years ago

I see there are two versions of S3FileSystemProvider, one in the nf-amazon plugin and another one in the nextflow-s3fs repo. But as far as I can tell the second one is being used. Paolo can you explain how this works? Want to make sure I understand correctly before I go make a PR on the other repo.

Also, I'm wondering if this warning will also catch directory uploads, i.e. does the client also use this method to upload each file in a directory.

pditommaso commented 2 years ago

The https://github.com/nextflow-io/nextflow-s3fs has been incorporathe in the nf-amazon plugin as of 22.x series. Therefore the latter is the one to be modified.

Directory upload is essentially a directory listing, followed by each file upload.

bentsherman commented 2 years ago

Okay I understand. It looks like S3MultipartOptions already performs this check while computing the chunk size:

https://github.com/nextflow-io/nextflow/blob/903a082befa9fd1042f421b494be877432e040af/plugins/nf-amazon/src/main/com/upplication/s3fs/util/S3MultipartOptions.java#L81-L89

Seems kind of pointless to make this adjustment if it's going to fail anyway. I think we should move this check to the code that actually performs the multipart copy.

pditommaso commented 2 years ago

Not sure this is still relevant. The upload in recent version is managed by this

https://github.com/nextflow-io/nextflow/blob/0ca21170672f2e2e81a3f13dae35f8d2d399c381/plugins/nf-amazon/src/main/com/upplication/s3fs/AmazonS3Client.java#L561-L575

and the chunk size is declared here

https://github.com/nextflow-io/nextflow/blob/0ca21170672f2e2e81a3f13dae35f8d2d399c381/plugins/nf-amazon/src/main/com/upplication/s3fs/AmazonS3Client.java#L483-L483

cjw85 commented 2 years ago

Do you not still have the issue that the Transfer manager is being constructed before examining files to know what an appropriate chunk size to use should be?

bentsherman commented 2 years ago

I think I was looking at the code for multipart copy, which actually doesn't seem to have any problem because it just increases the chunk size if it's too small. I wonder if the transfer manager is doing the same thing under the hood for multipart upload.

@cjw85 I think if you try the latest edge release you won't see this problem anymore.

cjw85 commented 2 years ago

Ooh thanks. Yeah I was using an older version originally. It was that Paulo made a comment that seemed to suggest the chunk size was fixed. When I scanned the current repo I did see code which seemed to calculate the chunk size based on the filesize so I was a little confused.

pditommaso commented 1 year ago

This has been improved in version 22.10.x