nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.75k stars 628 forks source link

Recursively downloading S3 directory restored from Glacier doesn't work #4747

Open Mridul-Chaudhary opened 8 months ago

Mridul-Chaudhary commented 8 months ago

New feature

For rnasplice pipeline, I wanted to provide my previous salmon output as an input as suggested here

So I prepared input sheet like this:

sample,condition,salmon_results xyz,case,s3://bucket/path/salmon/xyz

where xyz directory is in Glacier storage class but has been restored

xyz/ ├── aux_info │   ├── ambig_info.tsv │   ├── expected_bias.gz │   ├── fld.gz │   ├── meta_info.json │   ├── observed_bias_3p.gz │   └── observed_bias.gz ├── cmd_info.json ├── lib_format_counts.json ├── libParams │   └── flenDist.txt ├── logs │   └── salmon_quant.log ├── quant.genes.sf └── quant.sf

Each of the file inside xyz has been individually restored and has been tested for download. It does not require --force-glacier-transfer option.

But my sample-sheet above forces nxf_s3_download() function to attempt recursive download on xyz directory which is not downloadable without specifying --force-glacier-transfer option

Example: aws s3 cp s3://bucket/path/salmon/xyz ./xyz/ --recursive does not work Error: warning: Skipping file s3://bucket/path/salmon/xyz/aux_info/ambig_info.tsv. Object is of storage class GLACIER. Unable to perform download operations on GLACIER objects. You must restore the object to be able to perform the operation. See aws s3 download help for additional parameter options to ignore or force these transfers

aws s3 cp s3://bucket/path/salmon/xyz ./xyz/ --recursive --force-glacier-transfer works

Usage scenario

Provide restored glacier directory paths as input

Suggest implementation

Similar to a change made here , --quiet was changed to --only-show-errors , --force-glacier-transfer option could be made a default in case the s3 object is not in standard storage

Or, Introduce an optional setting similar to aws.client.glacierAutoRetrieval, say aws.client.glacierForceTransfer, wihch when true will add --force-glacier-transfer to nxf_s3_download() function

bentsherman commented 8 months ago

Hi @ludirm93 , we do not plan to support automatic Glacier restore and we have removed the glacierAutoRetrieval option. See https://github.com/nextflow-io/nextflow/issues/4511#issuecomment-1811881428 for the discussion on this point.

Instead, I recommend that you write a custom process to restore any Glacier objects all at once, to minimize the amount of time waiting for the Glacier restore

Mridul-Chaudhary commented 8 months ago

Hi @bentsherman Thank you for your response! May be I was not able to convey my request.. The s3 files have already been restored by me manually..

nextflow is not able to download even the pre-restored because in nxf_s3_download() function, it would require an aws cli option "--force-glacier-transfer"

Worst case, I would need to make a duplicate copy of my data in S3 and keep it in standard storage

bentsherman commented 8 months ago

That is strange, it should be able to download the S3 object like normal. How exactly did you restore it from Glacier?

Mridul-Chaudhary commented 8 months ago

For the salmon output folder tree shown above, I individually executed restore-object command for each file.

aws s3api restore-object --bucket bucket --key path/salmon/xyz/aux_info/ambig_info.tsv --restore-request '{"Days":17,"GlacierJobParameters":{"Tier":"Standard"}}'

When I try to download the files outside of nextflow, to test if it is downloadable, I can download each of the files with by using complete key

aws s3 cp s3://bucket/path/salmon/xyz/quant.sf ./

I need to provide entire salmon output directory xyz as an input.

aws s3 cp s3://bucket/path/salmon/xyz/ ./ --recursive does not work.

aws proposes to add "--force-glacier-transfer" option in recursive download of a prefix

The nxt_s3_download() function used to get these restored files is mentioned below:

nxf_s3_download() { local source=$1 local target=$2 local file_name=$(basename $1) local is_dir=$(/home/ec2-user/miniconda/bin/aws --region eu-west-1 s3 ls $source | grep -F "PRE ${file_name}/" -c) if [[ $is_dir == 1 ]]; then /home/ec2-user/miniconda/bin/aws --region eu-west-1 s3 cp --only-show-errors --recursive "$source" "$target" else /home/ec2-user/miniconda/bin/aws --region eu-west-1 s3 cp --only-show-errors "$source" "$target" fi }

It recognizes my input path to be a directory and thus goes with recursive download option which in this case won't work because it needs --force-glacier-transfer

I do not know if there is a way to pass this to nxf_s3_download()

Mridul-Chaudhary commented 8 months ago

Additionally, Listing recursively on a restored directory works without --force-glacier-transfer.

aws s3 ls s3://bucket/path/salmon/xyz/ --recursive

So another solution without passing --force-glacier-transfer could be,

if [[ $is_dir == 1 ]]; then list all files download each of the above

bentsherman commented 8 months ago

So this problem only happens when recursively downloading a directory, not with files

Mridul-Chaudhary commented 8 months ago

Thanks for re-opening the issue. Yes, that is correct.. This would affect pipelines or modules that need an entire directory as input. For me, it was RNAsplice pipeline with salmon results as input.

bentsherman commented 8 months ago

Seems like if the aws cli can recursively list the directory but not download it, it is a bug with the aws cli.

In the meantime, you can work around this issue by specifying the files in the directory as inputs rather than the directory itself. It's slightly more verbose but much better in the long run.

MrOlm commented 6 months ago

@bentsherman - it's not quite a bug (it's how they intend it to be- https://docs.aws.amazon.com/cli/latest/reference/s3/sync.html), but I agree it's a really strange implementation choice.

Just wanted to say I'm hitting the exact same issue as @Mridul-Chaudhary , but it's not quite clear to be how to switch from using directories to listing all files in a directory and using that instead.

Thanks, Matt

bentsherman commented 6 months ago

There is a way, but it ain't pretty if the input directory is deeply nested. Instead of providing the directory as a single input, you can enumerate all of the files in the directory so that they are downloaded individually instead of a single recursive download

I can understand the AWS CLI not wanting to recursively restore Glacier objects, but I don't understand why they are restricting objects that are already restored. Those objects should behave like regular objects

MrOlm commented 6 months ago

Totally agree with you on the AWS CLI implementation, and I was as surprised as you are to find they implemented it this way, but it is what it is.

Do you have any example code you could maybe point me to around this workaround? I'm just not sure how it would work 1) generating a list of these files in nextflow clode and 2) passing this list of files of variable length and order to a nextflow process.

Thanks! Matt

bentsherman commented 6 months ago

It's easiest if the directory is produced by a process, which I assume is the case.

Using the OP directory structure, you would normally declare it like this:

output:
path('xyz')

Instead you must declare as a tuple of paths for each group of files:

output:
tuple path('xyz/aux_info/*'), path('xyz/libParams/*'), path('xyz/logs/*'), path('xyz/*')

Though even that might not be specific enough. Given the wide variety of file types, you might need to declare each file individually 🤢

In general I don't recommend declaring entire directories as a single output as shown in the first example, because it doesn't tell you anything about the directory contents. I realize it's more convenient though. We might be able to get around all of this with #4553 , which would allow you to define a record type that enumerates all of those files, then you just declare the record as an input/output and the files are staged individually, which happens to solve this issue.

record Xyz {
  XyzAuxInfo aux_info
  Path cmd_info
  Path lib_format_counts
  // ...
}

record XyzAuxInfo {
  Path ambig_info
  Path expected_bias
  // ...
}

process foo {
  input:
  Xyz xyz

  script:
  """
  cat ${xyz.aux_info.ambig_info}
  """
}

This is aspirational, not something that can be done today, just wanted to note it here for future reference.

MrOlm commented 6 months ago

Thanks @bentsherman ! That makes to me.

I also just want to point out to @Mridul-Chaudhary and others that may have this issue, that there's another workaround that works as well. After restoring the files, you can remove them from glacier entirely using a command like:

$ aws s3 cp s3://sonn-current/instrain_profile/v1/ s3://sonn-current/instrain_profile/v1/ --storage-class STANDARD --recursive --force-glacier-transfer

After running that command the files are entirely removed from glacier and put into standard storage, meaning the next time you want to download them you don't need to use the --force-glacier-transfer flag (which also means they'll work with nextflow, which doesn't use the --force-glacier-transfer flag)

k1sauce commented 1 week ago

I am also not a fan of the way this was implemented in the AWS CLI (good luck getting any changes made there). If you follow the discussion on this thread (https://github.com/aws/aws-cli/issues/1699#issuecomment-171394505) it at least explain why they ended up implementing the --force-glacier-transfer flag.

For me I am running demultiplexing on an entire sequencing run data directory. The sequencing data sat in s3 glacier storage for a while and I restored it because I needed to run demux on it again. The way nxf_s3_download() is written means that I cannot download the dir and I get the glacier error.

I think given the design decision of the AWS CLI, the right thing to do here is to implement the --force-glacier-transfer flag as this feature request specifies.

Thanks

k1sauce commented 1 week ago

My preference is that it is implemented in the AWS scope. That way it is a simple configuration change. The reason I don't think it should be default is the same reason the --force-glacier-transfer flag was implemented to begin with; that is, it requires a HeadObject vs a GetObjects request.

I like @Mridul-Chaudhary suggestion here: _"Introduce an optional setting similar to aws.client.glacierAutoRetrieval, say aws.client.glacierForceTransfer, wihch when true will add --force-glacier-transfer to nxf_s3download() function"