nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.75k stars 628 forks source link

Can't list files from s3 #1128

Closed olgabot closed 5 years ago

olgabot commented 5 years ago

Bug report

Expected behavior and actual behavior

Hello, I'm trying to list the files in a particular s3 directory, where what I see using Nextflow and the awscli are different. Here is what I see using the awscli:

 Wed 24 Apr - 08:44  ~/code/nf-large-assembly   origin ☊ olgabot/wtf-aws 38☀ 1● 
  aws s3 ls s3://tick-genome/dna/2018-06-28/
                           PRE adapter_trimmed/
                           PRE adapter_trimmed_test/
                           PRE assemble_masurca/
                           PRE assemble_spades_test_v1/
                           PRE corrected_reads_test_datasets/
                           PRE multiqc/
                           PRE pre-assembly_qc_filter_test/
                           PRE pre-assembly_qc_filter_test_v2/
                           PRE pre-assembly_qc_filter_test_v2_multiqc/
                           PRE pre-assembly_qc_filter_test_v3/
                           PRE pre-assembly_qc_filter_test_v3_multiqc/
                           PRE pre-assembly_qc_filter_test_v4/
                           PRE pre-assembly_qc_filter_test_v5/
                           PRE pre-assembly_qc_filter_test_v6/
                           PRE pre-assembly_qc_filter_test_v6_multiqc/
                           PRE pre-assembly_quality_control/
                           PRE pre-assembly_v3/
2018-07-12 15:09:15     248345 Undetermined_S0_R1_stdin_fastqc.html
2018-07-12 15:09:15     579828 Undetermined_S0_R1_stdin_fastqc.zip
2018-07-12 15:08:09     252133 Undetermined_S0_R2_stdin_fastqc.html
2018-07-12 15:08:09     585457 Undetermined_S0_R2_stdin_fastqc.zip
2018-07-16 11:35:37  361957623 adapter_trimmed
2018-07-11 14:45:53   69238757 tick_1_S1_R1_001_first1Mreads.fastq.gz
2019-01-29 08:25:47   70900884 tick_1_S1_R1_post-trimming_first1Mreads.fastq.gz
2018-07-11 21:20:37     264494 tick_1_S1_R2
2018-07-11 14:45:53   72211181 tick_1_S1_R2_001_first1Mreads.fastq.gz
2019-01-29 08:26:27   74471726 tick_1_S1_R2_post-trimming_first1Mreads.fastq.gz
2018-07-12 14:42:17     264494 tick_1_S1_R2_stdin_fastqc.html
2018-07-12 14:42:17     603876 tick_1_S1_R2_stdin_fastqc.zip

But running this workflow to recursively list all files in the directory, only lists the parent directory:

Channel
  .fromPath("s3://tick-genome/dna/2018-06-28/**", type: 'any')
  .println()

Produces only this output, showing only the parent folder /tick-genome/dna/2018-06-28 when it should be showing all files recursively.

(base)
 ✘  Wed 24 Apr - 08:45  ~/code/nf-large-assembly   origin ☊ olgabot/wtf-aws 38☀ 1● 
  make scratch
nextflow run scratch.nf -e.process.executor=local \
        -dump-channels \
        -profile none \
        -e.aws.region=us-west-2
N E X T F L O W  ~  version 19.03.0-edge
Launching `scratch.nf` [dreamy_bose] - revision: a25a334eb9
/tick-genome/dna/2018-06-28
Completed at: 24-Apr-2019 08:47:42
Duration    : 2.4s
CPU hours   : (a few seconds)
Succeeded   : 0

Originally, I was trying to get the *{1,2}_001_first1Mreads.fastq.gz files but that channel was completely empty, i.e. adding these lines:

Channel
  .fromFilePairs("s3://tick-genome/dna/2018-06-28/*_R{1,2}_*.fastq.gz", type: 'any')
  .println()

Produces nearly the same output, though for some reason now complains about fastqc:

(base)
 Wed 24 Apr - 08:49  ~/code/nf-large-assembly   origin ☊ olgabot/wtf-aws 38☀ 1● 
  nextflow run scratch.nf
N E X T F L O W  ~  version 19.03.0-edge
Launching `scratch.nf` [compassionate_pike] - revision: a31f4e36c6
WARN: There's no process matching config selector: fastqc
/tick-genome/dna/2018-06-28
Completed at: 24-Apr-2019 08:55:12
Duration    : 2.3s
CPU hours   : (a few seconds)
Succeeded   : 0

Steps to reproduce the problem

Here is a self-contained nextflow script to reproduce the problem

Channel
  .fromFilePairs("s3://tick-genome/dna/2018-06-28/*_R{1,2}_*.fastq.gz", type: 'any')
  .println()

Program output

``` Apr-24 08:54:59.138 [main] DEBUG nextflow.cli.Launcher - $> nextflow run scratch.nf Apr-24 08:54:59.336 [main] INFO nextflow.cli.CmdRun - N E X T F L O W ~ version 19.03.0-edge Apr-24 08:54:59.350 [main] INFO nextflow.cli.CmdRun - Launching `scratch.nf` [compassionate_pike] - revision: a31f4e36c6 Apr-24 08:54:59.369 [main] DEBUG nextflow.config.ConfigBuilder - Found config home: /Users/olgabot/.nextflow/config Apr-24 08:54:59.370 [main] DEBUG nextflow.config.ConfigBuilder - Found config local: /Users/olgabot/code/nf-large-assembly/nextflow.config Apr-24 08:54:59.371 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /Users/olgabot/.nextflow/config Apr-24 08:54:59.371 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /Users/olgabot/code/nf-large-assembly/nextflow.config Apr-24 08:54:59.412 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard` Apr-24 08:54:59.999 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard` Apr-24 08:55:00.282 [main] DEBUG nextflow.Session - Session uuid: c47a1a86-3412-4e24-91a5-b71e85f74a59 Apr-24 08:55:00.282 [main] DEBUG nextflow.Session - Run name: compassionate_pike Apr-24 08:55:00.283 [main] DEBUG nextflow.Session - Executor pool size: 8 Apr-24 08:55:10.321 [main] DEBUG nextflow.cli.CmdRun - Version: 19.03.0-edge build 5061 Modified: 14-03-2019 17:26 UTC (10:26 PDT) System: Mac OS X 10.14.4 Runtime: Groovy 2.5.6 on Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 Encoding: UTF-8 (UTF-8) Process: 42996@Olgas-MacBook-Pro.local [192.168.1.16] CPUs: 8 - Mem: 16 GB (4.3 GB) - Swap: 4 GB (1.3 GB) Apr-24 08:55:10.344 [main] DEBUG nextflow.Session - Work-dir: /Users/olgabot/code/nf-large-assembly/work [Mac OS X] Apr-24 08:55:10.596 [main] DEBUG nextflow.Session - Session start invoked Apr-24 08:55:10.600 [main] DEBUG nextflow.processor.TaskDispatcher - Dispatcher > start Apr-24 08:55:10.600 [main] DEBUG nextflow.trace.TraceFileObserver - Flow starting -- trace file: /Users/olgabot/code/nf-large-assembly/test-output/pipeline_info/nf-core/nf-large-assembly_trace.txt Apr-24 08:55:10.610 [main] DEBUG nextflow.script.ScriptRunner - > Script parsing Apr-24 08:55:10.705 [main] WARN nextflow.Session - There's no process matching config selector: fastqc Apr-24 08:55:10.706 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution Apr-24 08:55:10.754 [PathVisitor-1] DEBUG nextflow.file.FileHelper - Creating a file system instance for provider: S3FileSystemProvider Apr-24 08:55:10.767 [PathVisitor-1] DEBUG nextflow.Global - Using AWS credential defined in `default` section in file: /Users/olgabot/.aws/credentials Apr-24 08:55:10.772 [PathVisitor-1] DEBUG nextflow.file.FileHelper - AWS S3 config details: {secret_key=Zl6hSJ.., max_connections=20, upload_storage_class=INTELLIGENT_TIERING, storage_encryption=AES256, access_key=AKIAI2.., region=us-west-2, connection_timeout=10000} Apr-24 08:55:10.822 [main] DEBUG nextflow.script.ScriptRunner - > Await termination Apr-24 08:55:10.823 [main] DEBUG nextflow.Session - Session await Apr-24 08:55:10.823 [main] DEBUG nextflow.Session - Session await > all process finished Apr-24 08:55:10.823 [main] DEBUG nextflow.Session - Session await > all barriers passed Apr-24 08:55:11.182 [PathVisitor-1] DEBUG nextflow.file.PathVisitor - files for syntax: glob; folder: /tick-genome/dna/2018-06-28/; pattern: **; options: [type:any] Apr-24 08:55:11.182 [PathVisitor-2] DEBUG nextflow.file.PathVisitor - files for syntax: glob; folder: /tick-genome/dna/2018-06-28/; pattern: *{1,2}_001_first1Mreads.fastq.gz; options: [type:any] Apr-24 08:55:11.186 [PathVisitor-1] DEBUG nextflow.file.FileHelper - Path matcher not defined by 'S3FileSystem' file system -- using default default strategy Apr-24 08:55:11.186 [PathVisitor-2] DEBUG nextflow.file.FileHelper - Path matcher not defined by 'S3FileSystem' file system -- using default default strategy Apr-24 08:55:11.833 [main] DEBUG nextflow.trace.StatsObserver - Workflow completed > WorkflowStats[succeedCount=0; failedCount=0; ignoredCount=0; cachedCount=0; succeedDuration=0ms; failedDuration=0ms; cachedDuration=0ms] Apr-24 08:55:11.833 [main] DEBUG nextflow.trace.TraceFileObserver - Flow completing -- flushing trace file Apr-24 08:55:11.836 [main] DEBUG nextflow.trace.ReportObserver - Flow completing -- rendering html report Apr-24 08:55:11.840 [main] DEBUG nextflow.trace.ReportObserver - Execution report summary data: {} Apr-24 08:55:12.856 [main] DEBUG nextflow.trace.TimelineObserver - Flow completing -- rendering html timeline Apr-24 08:55:13.062 [main] DEBUG nextflow.CacheDB - Closing CacheDB done Apr-24 08:55:13.068 [main] DEBUG nextflow.Session - AWS S3 uploader shutdown Apr-24 08:55:13.095 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye ```

Environment

Additional context

It seems to be something wrong with this bucket or folder, as I'm able to list objects in other buckets. However, I've made this folder publicly viewable and the bucket policy is quite permissive:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AddPerm",
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "s3:Get*",
                "s3:List*"
            ],
            "Resource": [
                "arn:aws:s3:::tick-genome/*",
                "arn:aws:s3:::tick-genome"
            ]
        }
    ]
}

This seems related to this: https://github.com/nextflow-io/nextflow/issues/1121

pditommaso commented 5 years ago

Listing the parent folder there are two 2018-06-28 entries .. this looks suspicious:

/tick-genome/dna/2018-06-28
/tick-genome/dna/2018-06-28
/tick-genome/dna/2018-07-26-pacbio
/tick-genome/dna/2018-08-03-pacbio-raw
/tick-genome/dna/2018-08-16-sanger
/tick-genome/dna/2018-10-10-dovetail
/tick-genome/dna/2018-10-11_ise6_asm2.2
/tick-genome/dna/2018-12-03_IscaW1
/tick-genome/dna/2018-12-03_quast
/tick-genome/dna/tick_pacbio_20180813
olgabot commented 5 years ago

Ah I think I had accidentally saved an object as /tick-genome/dna/2018-06-28 (no final slash) instead of /tick-genome/dna/2018-06-28/ (with a final slash). Maybe that's causing the error?

olgabot commented 5 years ago

Yes, removing the offending object worked!

(base)
 Wed 24 Apr - 13:03  ~/code/nf-large-assembly   origin ☊ olgabot/wtf-aws ✔ 1☀ 
  aws s3 ls s3://tick-genome/dna/
                           PRE 2018-06-28/
                           PRE 2018-07-26-pacbio/
                           PRE 2018-08-03-pacbio-raw/
                           PRE 2018-08-16-sanger/
                           PRE 2018-10-10-dovetail/
                           PRE 2018-10-11_ise6_asm2.2/
                           PRE 2018-12-03_IscaW1/
                           PRE 2018-12-03_quast/
                           PRE tick_pacbio_20180813/
2018-07-11 13:24:40      79376 2018-06-28
(base)
 Wed 24 Apr - 15:07  ~/code/nf-large-assembly   origin ☊ olgabot/wtf-aws ✔ 1☀ 
  aws s3 rm --dryrun  s3://tick-genome/dna/2018-06-28
(dryrun) delete: s3://tick-genome/dna/2018-06-28
(base)
 Wed 24 Apr - 15:07  ~/code/nf-large-assembly   origin ☊ olgabot/wtf-aws ✔ 1☀ 
  aws s3 rm s3://tick-genome/dna/2018-06-28
delete: s3://tick-genome/dna/2018-06-28

Now this workflow:

Channel
  .fromPath("s3://tick-genome/dna/2018-06-28/*.fastq.gz", type: 'any')
  .println()

Produces this output:

 Wed 24 Apr - 15:14  ~/code/nf-large-assembly   origin ☊ olgabot/wtf-aws ✔ 1☀ 
  make scratch
nextflow run scratch.nf -e.process.executor=local \
        -dump-channels \
        -profile none \
        -e.aws.region=us-west-2
N E X T F L O W  ~  version 19.04.0
Launching `scratch.nf` [irreverent_varahamihira] - revision: 883bf431da
/tick-genome/dna/2018-06-28/tick_1_S1_R1_001_first1Mreads.fastq.gz
/tick-genome/dna/2018-06-28/tick_1_S1_R1_post-trimming_first1Mreads.fastq.gz
/tick-genome/dna/2018-06-28/tick_1_S1_R2_001_first1Mreads.fastq.gz
/tick-genome/dna/2018-06-28/tick_1_S1_R2_post-trimming_first1Mreads.fastq.gz
pditommaso commented 5 years ago

OK. Basically, you had a file with the name of a directory path, right?

olgabot commented 5 years ago

Yep, that's correct. But it wasn't exactly the same name as it did not end in a Slash "/"

On Thu, Apr 25, 2019, 07:02 Paolo Di Tommaso notifications@github.com wrote:

OK. Basically, you had a file with the name of a directory path, right?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextflow-io/nextflow/issues/1128#issuecomment-486685641, or mute the thread https://github.com/notifications/unsubscribe-auth/AAGE24HRTYFVA56XT6DH2VTPSG2VNANCNFSM4HIFLSJQ .

pditommaso commented 5 years ago

OK, closing this and opening a relative issue in the S3 library project https://github.com/nextflow-io/nextflow-s3fs/issues/16.