nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.78k stars 634 forks source link

Loop in Nextflow logic caused thousands of empty folders in S3 staging folder inside workdir #4654

Open kpmos opened 10 months ago

kpmos commented 10 months ago

Bug report

Unclear on exactly how this happened, but our code set some paths some where explicitly to “/“ or an empty string. This worked locally, however, when that code ran in AWS Batch with a S3 workdir it caused an endless loop. The error we got was:

WARN: Path integrity check failed because the following file has been deleted: s3://bucket-name/pipeline-workdirs/job-id/stage-b0b335df-e31b-4612-a89e-9d2d3d56e41f/2c/fa25ce3231be63235902b85b01769e/bin -- make sure to not run more than one nextflow instance using the same work directory

Within the root of the workdir on S3 was the folder: stage-b0b335df-e31b-4612-a89e-9d2d3d56e41f

Within that stage folder was many thousand empty folders with a 2 character path followed by random string path like: /2c/fa25ce3231be63235902b85b01769e

None of the empty folders had a subfolder of bin (mentioned in the error) or any files in them.

Expected behavior and actual behavior

Expected - If a file is deleted or removed in S3 (or in our case just doesn't exist), then cancel the loop, throw error and end job.

Actual - Error detected that file in S3 might have been deleted. Job kept retrying over and over to stage something, which resulted in an empty folder each time.

Steps to reproduce the problem

I'm not sure how to reproduce without providing our code which I cannot do in its current form.

Program output

The error looks to be from here: https://github.com/nextflow-io/nextflow/blob/d67aef5fecf09132a02276cc6735d431b7c84c42/modules/nextflow/src/main/groovy/nextflow/file/FilePorter.groovy#L405

And the loop calling it here: https://github.com/nextflow-io/nextflow/blob/d67aef5fecf09132a02276cc6735d431b7c84c42/modules/nextflow/src/main/groovy/nextflow/file/FilePorter.groovy#L375

Environment

Additional context

Within 10 minutes this loop caused over 1 million log entries and many thousand empty folders within a stage directory inside the workdir on S3. With this many S3 and CloudWatch API calls, it could become an expensive problem if it had been left running.

In our setup, we do not get any .nextflow.log files from a failed Batch job when manually terminated. So no further info to provide.

The workdir location in S3 was a new work directory, despite the error suggesting it may not be.

I guess the question I'm asking is, should Nextflow continue to try staging files to S3 in a scenario where it can’t find the file, or should the loop exit and the job error out?

Or is it just a client side issue as our code was poorly written? If so, then feel free to close this issue but it feels worth raising this due to the potential of an endless loop when working with S3.

bentsherman commented 10 months ago

There is a loop there so that the file porter can skip over any conflicting directories, but I don't see how it could get into an infinite loop of nested directories like that. It looks like the staging was for the bin directory, which Nextflow uploads to S3 before running any tasks.

our code set some paths some where explicitly to “/“ or an empty string

Do you think this caused the infinite loop? If so could you share an example?

kpmos commented 9 months ago

The example below is similar to a process that was called from a workflow where we saw this issue. When we had this issue we were using an S3 location as a workdir passed in on the command line. The problem_file input is the one which we passed in as "/".

When we set this to "/not_a_real_file.txt" then Nextflow errored out as it could not find the file in S3. So I guess it does some type of check against S3 before staging it, rather than locally where it seems to create a symlink.

When we set this to "/" then Nextflow did not error, instead went into the loop of creating stage folders in our S3 workdir. Locally it created a symlink and didn’t end up in a loop.

process testing {
    storeDir params.out_dir + '/folder'

    input:
        path(problem_file)       // This gets passed in as “/“
        path(metadata)
        path("my_file.tsv")

    output:
        path("report.json", emit:logs)

    exec:
        problem_file_string = problem_file ? problem_file : ""

    script:
        python do_something.py
}
bentsherman commented 9 months ago

What I'm not understanding is why would you provide the path / as an input to a process? What is your intent with that?

Also not sure if it's related but why is there both an exec block and script block in the process? That shouldn't be allowed, but to be fair I think our parser just isn't strict enough

kpmos commented 9 months ago

Fair, it was a poor code decision which has now been refactored. The intent was if the input file was / then do something in the code (which viewed it as an empty file to ignore), if it was /real_file.txt then do something else with the actual file. It was basically being used as a flag within the script that runs.

Honestly, unsure why there is both a an exec and script block.

The S3 part was what worried me. It seemed that when using S3 as a workdir, then passing / into a process as an input, it would cause a folder staging loop. We've now altered the code so its not a problem to us any longer, but figured you might want to investigate/test if there is a looping issue.

If I'm not providing enough info (which I appreciate I might not be), then feel free to close the issue and it can always be reopened if others hit the same problem.

drernie commented 7 months ago

I am running into what may be the same issue. And my code is open source :-)

https://github.com/quiltdata/nf-quilt

export WRITE_BUCKET=my-s3-bucket
make pkg-test

Which gives many versions of:

WARN: Path integrity check failed because the following file has been deleted: /Users/ernest/GitHub/nf-quilt/work/stage-2687d2a2-f7da-4664-a501-c7c3032599b4/36/7a54b57c091871a11472e9a1db99fa/data/atlantic-storms.csv -- make sure to not run more than one nextflow instance using the same work directory

This is running against a locally-build nextflow instance: nextflow version 24.03.0-edge.5908

jhidalgo-lopez commented 1 week ago

I encountered this issue while debugging some S3-MinIO issues we were having. A minimal example (test.nf):

workflow {
    // Define workflow parameters
    params.input  = null
    params.output = null

    // Check required parameters
    if (!params.input || !params.output) {
        error "Both --input and --output arguments must be provided."
    }

    // Define and execute the pipeline process
    copy_file(params.input, params.output)
}

process copy_file {
    input:
    path input_file // Input file path
    path output_dir // Output directory

    output:
    path "${output_dir}/copied_file" // Output file path

    script:
    """
    cp $input_file $output_dir/copied_file
    """
}

nextflow run test.nf --input s3://BUCKET/file --output s3://BUCKET/PATH

S3 -> Local: All good S3 (or local) -> "S3: WARN: Path integrity check failed because the following file has been deleted" loop

This is running Nextflow 24.10.1 with S3 MinIO with this .config:

aws {
    accessKey = 'KEY'
    secretKey = 'SECRET'
    region = ''
    client {
        endpoint = 'https://URL:9000'
        protocol = 'https'
        s3PathStyleAccess = true
    }
}