Open kpmos opened 10 months ago
There is a loop there so that the file porter can skip over any conflicting directories, but I don't see how it could get into an infinite loop of nested directories like that. It looks like the staging was for the bin
directory, which Nextflow uploads to S3 before running any tasks.
our code set some paths some where explicitly to “/“ or an empty string
Do you think this caused the infinite loop? If so could you share an example?
The example below is similar to a process that was called from a workflow where we saw this issue. When we had this issue we were using an S3 location as a workdir passed in on the command line. The problem_file
input is the one which we passed in as "/"
.
When we set this to "/not_a_real_file.txt"
then Nextflow errored out as it could not find the file in S3. So I guess it does some type of check against S3 before staging it, rather than locally where it seems to create a symlink.
When we set this to "/"
then Nextflow did not error, instead went into the loop of creating stage folders in our S3 workdir. Locally it created a symlink and didn’t end up in a loop.
process testing {
storeDir params.out_dir + '/folder'
input:
path(problem_file) // This gets passed in as “/“
path(metadata)
path("my_file.tsv")
output:
path("report.json", emit:logs)
exec:
problem_file_string = problem_file ? problem_file : ""
script:
python do_something.py
}
What I'm not understanding is why would you provide the path /
as an input to a process? What is your intent with that?
Also not sure if it's related but why is there both an exec
block and script
block in the process? That shouldn't be allowed, but to be fair I think our parser just isn't strict enough
Fair, it was a poor code decision which has now been refactored. The intent was if the input file was /
then do something in the code (which viewed it as an empty file to ignore), if it was /real_file.txt
then do something else with the actual file. It was basically being used as a flag within the script that runs.
Honestly, unsure why there is both a an exec
and script
block.
The S3 part was what worried me. It seemed that when using S3 as a workdir, then passing /
into a process as an input, it would cause a folder staging loop. We've now altered the code so its not a problem to us any longer, but figured you might want to investigate/test if there is a looping issue.
If I'm not providing enough info (which I appreciate I might not be), then feel free to close the issue and it can always be reopened if others hit the same problem.
I am running into what may be the same issue. And my code is open source :-)
https://github.com/quiltdata/nf-quilt
export WRITE_BUCKET=my-s3-bucket
make pkg-test
Which gives many versions of:
WARN: Path integrity check failed because the following file has been deleted: /Users/ernest/GitHub/nf-quilt/work/stage-2687d2a2-f7da-4664-a501-c7c3032599b4/36/7a54b57c091871a11472e9a1db99fa/data/atlantic-storms.csv -- make sure to not run more than one nextflow instance using the same work directory
This is running against a locally-build nextflow instance: nextflow version 24.03.0-edge.5908
I encountered this issue while debugging some S3-MinIO issues we were having. A minimal example (test.nf):
workflow {
// Define workflow parameters
params.input = null
params.output = null
// Check required parameters
if (!params.input || !params.output) {
error "Both --input and --output arguments must be provided."
}
// Define and execute the pipeline process
copy_file(params.input, params.output)
}
process copy_file {
input:
path input_file // Input file path
path output_dir // Output directory
output:
path "${output_dir}/copied_file" // Output file path
script:
"""
cp $input_file $output_dir/copied_file
"""
}
nextflow run test.nf --input s3://BUCKET/file --output s3://BUCKET/PATH
S3 -> Local: All good S3 (or local) -> "S3: WARN: Path integrity check failed because the following file has been deleted" loop
This is running Nextflow 24.10.1 with S3 MinIO with this .config:
aws {
accessKey = 'KEY'
secretKey = 'SECRET'
region = ''
client {
endpoint = 'https://URL:9000'
protocol = 'https'
s3PathStyleAccess = true
}
}
Bug report
Unclear on exactly how this happened, but our code set some paths some where explicitly to “/“ or an empty string. This worked locally, however, when that code ran in AWS Batch with a S3 workdir it caused an endless loop. The error we got was:
Within the root of the workdir on S3 was the folder:
stage-b0b335df-e31b-4612-a89e-9d2d3d56e41f
Within that stage folder was many thousand empty folders with a 2 character path followed by random string path like:
/2c/fa25ce3231be63235902b85b01769e
None of the empty folders had a subfolder of
bin
(mentioned in the error) or any files in them.Expected behavior and actual behavior
Expected - If a file is deleted or removed in S3 (or in our case just doesn't exist), then cancel the loop, throw error and end job.
Actual - Error detected that file in S3 might have been deleted. Job kept retrying over and over to stage something, which resulted in an empty folder each time.
Steps to reproduce the problem
I'm not sure how to reproduce without providing our code which I cannot do in its current form.
Program output
The error looks to be from here: https://github.com/nextflow-io/nextflow/blob/d67aef5fecf09132a02276cc6735d431b7c84c42/modules/nextflow/src/main/groovy/nextflow/file/FilePorter.groovy#L405
And the loop calling it here: https://github.com/nextflow-io/nextflow/blob/d67aef5fecf09132a02276cc6735d431b7c84c42/modules/nextflow/src/main/groovy/nextflow/file/FilePorter.groovy#L375
Environment
Additional context
Within 10 minutes this loop caused over 1 million log entries and many thousand empty folders within a stage directory inside the workdir on S3. With this many S3 and CloudWatch API calls, it could become an expensive problem if it had been left running.
In our setup, we do not get any
.nextflow.log
files from a failed Batch job when manually terminated. So no further info to provide.The workdir location in S3 was a new work directory, despite the error suggesting it may not be.
I guess the question I'm asking is, should Nextflow continue to try staging files to S3 in a scenario where it can’t find the file, or should the loop exit and the job error out?
Or is it just a client side issue as our code was poorly written? If so, then feel free to close this issue but it feels worth raising this due to the potential of an endless loop when working with S3.