nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.71k stars 620 forks source link

Ability to publish to S3 buckets with Object Lock by including md5 checksum. #5347

Open robsyme opened 5 days ago

robsyme commented 5 days ago

New feature

AWS S3 Object Lock is a feature in Amazon Simple Storage Service (S3) that allows you to protect objects from being deleted or overwritten for a specified retention period. Some orgs will use Object Lock to meet data retention or regulatory requirements.

The PutObject call to a bucket with Object Lock requires an extra Content-MD5 header (docs). Currently, publishing an object to a bucket with Object Lock returns an error:

Error executing process > 'MYPROCESS (1)'

Caused by:
  Content-MD5 OR x-amz-checksum- HTTP header is required for Put Object requests with Object Lock parameters (Service: Amazon S3; Status Code: 400; Error Code: InvalidRequest; Request ID: XXXX; S3 Extended Request ID: XXXX; Proxy: null)

Usage scenario

I'd like to publish a file to a bucket that has Object Lock turned on.

Suggest implementation

For large multipart uploads (often the case with Nextflow published files), the ETag does not represent the MD5 digest, so the object will likely need to be downloaded and hashed by Nextflow so that the Content-MD5 header can be attached to the publication event.

bentsherman commented 4 days ago

What a mess. S3 needs to just accept the ETag for this

robsyme commented 4 days ago

100% Agree. It's a little infuriating that a copy from one bucket two another requires a separate digest.