nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.77k stars 629 forks source link

Nextflow creates object ending with slash during when using publishDir #5074

Open adriannavarrobetrian opened 5 months ago

adriannavarrobetrian commented 5 months ago

Bug report

When using the publishDir directive to send outputs to an s3 location, it looks like Nextflow creates a zero-sized object with a key ending in a slash at the publishDir location. While technically allowed by S3, this creates issues when performing operations on the resulting publishDir location (like recursing over objects or counting the number of objects under a prefix). It will also keep empty prefixes around; the objects themselves cannot be seen in the console, and if you try to copy the prefix using the AWS CLI naively, it fails.

Steps to reproduce the problem

Minimal example to replicate:

process example {
    publishDir "s3://my-bucket/buy-why-nextflow/test"

    input:
    val sample

    output:
    path "*fastq.gz"

    script:
    """
    touch ${sample}.fastq.gz
    """
}

workflow {
    example(Channel.of("SAMP1", "SAMP2"))
}
$ aws s3 ls --recursive s3://my-bucket/test/
2024-06-13 15:15:04          0 test/
2024-06-13 15:15:04          0 test/SAMP1.fastq.gz
2024-06-13 15:15:03          0 test/SAMP2.fastq.gz

Program output

$ aws s3 cp s3://my-bucket/test/ ./
download failed: s3://my-bucket/test/ to ./ [Errno 21] Is a directory: '/some_local_dir/.5beAaC30' -> '/some_local_dir/'

Environment

pditommaso commented 5 months ago

I'm not understanding what's supposed to be prefix in your example.

I've used this process definition

process example {
    publishDir "s3://nextflow-ci/buy-why-nextflow"

    input:
    val sample

    output:
    path "*fastq.gz"

    script:
    """
    touch ${sample}.fastq.gz
    """
}

I'm getting this result that's perfectly fine

2024-06-18 18:16:52          0 
2024-06-18 18:16:52          0 SAMP1.fastq.gz
2024-06-18 18:16:52          0 SAMP2.fastq.gz
adriannavarrobetrian commented 5 months ago

Sorry, I copied the example wrong. It's a folder, I updated it to test.

pditommaso commented 5 months ago

It's essentially the same, I don't see why it should not work

ewels commented 4 months ago

@pditommaso I think the problem reported is the top line of your output:

2024-06-18 18:16:52          0 

That's a zero-sized object. OP is asking if this can not be created.

pditommaso commented 4 months ago

Fascinating, I see it now

bentsherman commented 4 months ago

I think it happens because Nextflow proactively creates the base publish directory before publishing files. I assumed that the S3 filesystem would map mkdir to a no-op but apparently it is creating an empty "prefix" object

pditommaso commented 4 months ago

Yep realised the same. something similar is made on amazon creating a dot (hidden) file.

ewels commented 4 months ago

I think it happens because Nextflow proactively creates the base publish directory before publishing files.

Is it possible / advisable to simply skip this step for AWS s3?

bentsherman commented 4 months ago

We could do that, or we could make createDirectory() a no-op for the S3 filesystem: https://github.com/nextflow-io/nextflow/blob/12b027ee7e70d65bdee912856478894af4602170/plugins/nf-amazon/src/main/nextflow/cloud/aws/nio/S3FileSystemProvider.java#L468-L489

I'm not sure why you would ever need it... but something tells me that someone's pipeline will break if we remove it 😅