nextflow-io / nextflow-s3fs

An S3 File System Provider for Java 7 (project archived)
Apache License 2.0
1 stars 10 forks source link

File access to public s3 bucket fails when checkIfExists: true #14

Open sb43 opened 5 years ago

sb43 commented 5 years ago

Also raised on: https://github.com/nextflow-io/nextflow/issues/1055

Bug report

Expected behavior and actual behavior

testfile = Channel.fromPath("s3://ref/test/genome.fa.fai", checkIfExists: true)
Mar-01 11:56:36.524 [PathVisitor-1] ERROR nextflow.Channel - null (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID:-XXXXXX -default; S3 Extended Request ID: 144c2bfe-default-default)

When using private s3 bucket file access works with both true and false. When checkIfExists: false it works fine.

Steps to reproduce the problem

cat test.nf
testfile = Channel.fromPath("s3://ref/test/genome.fa.fai", checkIfExists: true)
nextflow run test.nf

Program output

Mar-01 12:28:14.810 [PathVisitor-1] ERROR nextflow.Channel - null (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: tx000000000000003e8cd66-005c79255e-13d7a414-default; S3 Extended Request ID: 13d7a414-default-default)
com.amazonaws.services.s3.model.AmazonS3Exception: null (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: tx000000000000003e8cd66-005c79255e-13d7a414-default; S3 Extended Request ID: 13d7a414-default-default)

Environment

Additional context

(Add any other context about the problem here)

pditommaso commented 5 years ago

Not sure to understand what's the problem here. The error AccessDenied signals a permission problem.

sb43 commented 5 years ago

When using private s3 bucket file access works with: Channel.fromPath("s3://ref/test/genome.fa.fai", checkIfExists: true) Channel.fromPath("s3://ref/test/genome.fa.fai", checkIfExists: false) However when using public bucket it only works with: Channel.fromPath("s3://ref/test/genome.fa.fai", checkIfExists: false)

pditommaso commented 5 years ago

I see, thank. Do you have a real public S3 URI to test it?

sb43 commented 5 years ago

Yes the on in example is public URL. s3://ref/test/genome.fa.fai

pditommaso commented 5 years ago

Doesn't look so:

$ aws s3 cp s3://ref/test/genome.fa.fai - 
download failed: s3://ref/test/genome.fa.fai to - An error occurred (403) when calling the HeadObject operation: Forbidden
sb43 commented 5 years ago

Sorry @pditommaso please use following endpoint as this bucket is on COG s3 storage. aws s3 cp s3://ref/test/genome.fa.fai . --endpoint-url=https://cog.sanger.ac.uk/

sb43 commented 5 years ago

to access using nextflow you can add following lines to nextflow.config file. aws { client { endpoint = "https://cog.sanger.ac.uk/" signerOverride = "S3SignerType" } }

pditommaso commented 5 years ago

I'm getting this

$ aws s3 cp s3://ref/test/genome.fa.fai . --endpoint-url=https://cog.sanger.ac.uk/
fatal error: An error occurred (403) when calling the HeadObject operation: Forbidden
sb43 commented 5 years ago

Sorry for the delay in reply, you can access the public bucket using following command: aws s3 cp s3://ref/test/genome.fa.fai - --endpoint-url=https://cog.sanger.ac.uk/ --no-sign-request

sb43 commented 5 years ago

possibly need to add new aws.client.noSignRequest flag in the client configuration file: https://www.nextflow.io/docs/latest/config.html#config-aws

shaze commented 4 years ago

I've found a similar issue. Here's my MWE. The bucket h3agwas is a public bucket. What I'm trying to do is have a public bucket so a user of the workflow without Amazon credentials can run a quickstart demo

inpat = "s3://h3agwas/sampleA"
datach = Channel.fromFilePairs("${inpat}.{bed,bim,fam}",size:3, flat : true)
   { file -> file.baseName }
datach.subscribe { println it[1].exists() };
  1. If I run this with a valid ~/.aws/config directory it works fine
  2. If I remove the config directory I get the permission error reported at the top.
  3. If I change the last line to datach.subscribe { println it[1]};, it works fine (that is it works fine without AWS credentials)

It seems like there's a problem with the exists function if you don't have credentials even if the bucket is fully public. The file itself is accessible in the workflow. So for example, replacing the last line with

process count {
  input:
    set val(x), file(bed), file(bim), file(fam) from datach
  output:
    stdout into see
  script:
    """
    wc -l ${bim}
    """
    }

works fine without any credentials.

A viable, though not ideal, workaround for me is to change my workflow so that if an S3 bucket as the source is specified we don't check that files exist (even though this would work if there were credentials). Perhaps there is some config in the S3 bucket that exists relies on?