Closed tomas-pluskal closed 5 years ago
Indeed, it does.
Nov-14 18:41:52.392 [Task monitor] TRACE n.executor.AbstractGridExecutor - [LSF] queue status > map does not contain jobId: `5173092`
Nov-14 18:41:52.392 [Task monitor] TRACE nextflow.executor.GridTaskHandler - Exit file does not exist for and the job is not running for task: TaskHandler[jobId: 5173092; id: 337; name: kallisto; status: RUNNING; exit: -; error: -; workDir: /lab/weng_scratch/Tomas/nextflow_assemble/work/9a/b0fc905d6f100d31af1788f26b1d6a started: 1542238727399; exited: -; ] -- Try to wait before kill it
then after 5 secs, find it (exit is 0
) and flag the task successfully completed:
Nov-14 18:41:57.502 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 5173092; id: 337; name: kallisto; status: COMPLETED; exit: 0; error: -; workDir: /lab/weng_scratch/Tomas/nextflow_assemble/work/9a/b0fc905d6f100d31af1788f26b1d6a started: 1542238727399; exited: 2018-11-14T23:41:43Z; ]
Then, problem is that it cannot find the expected out file, likely for the same reason (latency).
I've seen this happens in some NSF storage, not sure how NF can help here. Try to discuss with your sys admins.
So NF waits up to two minutes for the exit code file, but it does not wait for the output files? Don’t you think it should?
Waiting for each output file may introduce a huge overhead.
Waiting is necessary, as clearly demonstrated by this issue.
Why would it introduce overhead?
Because it may require to wait several minutes for each file declared as output. I think this is the signal of a misconfigured environment.
You already mentioned you had seen this issue happen before on NFS. It has happened to me several times over last ~2 weeks that I have been playing with NF. So whether it is misconfigured environment or not (debatable), it is a real-life issue that NF has to deal with.
In practice, the filesystem delay will only be in the order of seconds. It will only hit the timeout in cases when the output file is not produced, but that indicates a failed run, where a bit of delay is acceptable.
Another possible solution is to check for the output files on the same node that executed the job. Ie, the script running on that node could produce a .outputfiles file with a list of files that were created. The main control script would then rely on this file instead of checking the filesystem.
If you think you have viable solution, feel free to submit a pull-request.
I'll give it a try. Might need some help on the way.
Just FYI, Snakemake seems to be aware of this issue:
--latency-wait SECONDS, --output-wait SECONDS, -w SECONDS
Wait given seconds if an output file of a job is not
present after the job finished. This helps if your
filesystem suffers from latency (default 5).
I also have this "latency problem" with my PBS system. Attaching
errorStrategy 'retry'
looks working for me. But let me know, tomas-pluskal, if you get a better solution.
Hi Paolo,
I'm also running into this issue. Would a reasonable solution be to wrap this for loop: https://github.com/nextflow-io/nextflow/blob/21ef7f51f7bde0633c3adbc98b5fd2a8a9e285a1/modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy#L1358 in a small loop to enable the files to be checked for a few times before throwing the exception? E.g. the below line would only be called after the files had been checked and failed several times (say 10 seconds of checking) https://github.com/nextflow-io/nextflow/blob/21ef7f51f7bde0633c3adbc98b5fd2a8a9e285a1/modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy#L1383
Here is a mockup of what the code could look like (I am not a Groovy expert):
int pollIntervalMillis = 250 //Poll for missing files every 250 ms.
int maxWaitTimeMillis = 10000 //Wait up to 10 seconds
int i=0
List<String> entriesToRemove = []
while (waitedLongEnough != true && len(entries) > 0) {
if (i * pollIntervalMillis > maxWaitTimeMillis)
waitedLongEnough = true
// for each of them collect the produced files
for( String filePattern : entries ) {
List<Path> result = null
def splitter = param.glob ? FilePatternSplitter.glob().parse(filePattern) : null
if( splitter?.isPattern() ) {
result = fetchResultFiles(param, filePattern, workDir)
// filter the inputs
if( !param.includeInputs ) {
result = filterByRemovingStagedInputs(task, result, workDir)
log.trace "Process ${task.name} > after removing staged inputs: ${result}"
}
}
else {
def path = param.glob ? splitter.strip(filePattern) : filePattern
def file = workDir.resolve(path)
def exists = param.followLinks ? file.exists() : file.exists(LinkOption.NOFOLLOW_LINKS)
if( exists )
result = [file]
else
log.debug "Process `${task.name}` is unable to find [${file.class.simpleName}]: `$file` (pattern: `$filePattern`)"
}
if( result ) {
allFiles.addAll(result)
entriesToRemove.add(filePattern)
}
else if( !param.optional && waitedLongEnough ) //Only throw an exception if the wait time has passed, otherwise loop again
throw new MissingFileException("Missing output file(s) `$filePattern` expected by process `${task.name}`")
}
//Remove all the successful filePatterns, then loop again.
for( String e : entriesToRemove) {
entries = entries.removeElement(e)
}
sleep(pollIntervalMillis)
i+=1
}
@photocyte actually a problem that could be related to this has been spotted, see this comment https://github.com/nextflow-io/nextflow/issues/1045#issuecomment-467789063
Thank you Paolo, I'll try to test if this issue still exists with the new snapshot. But maybe the issues are different? For example you say:
Looking at the log it looks NF cannot retrieve the job .exitcode file.
In this case, the exit code file is found (after the delay from executor.exitReadTimeout
), but the output files have not yet appeared when Nextflow checks, although they show up later (probably < 5 seconds)
It seems that the issue is not filesystem 'latency', exactly, as Nextflow can wait indefinitely for the exit code file with the executor.exitReadTimeout
parameter. Rather, it seems Nextflow currently requires the filesystem to guarantee that files appear in the same order they were written (output files, as they were written before the exitcode file, must appear before the exitcode file does).
Based on some reading, I think the ext2 and brtfs local filesystems don't provide this guarantee (https://danluu.com/file-consistency/), and nfs on top of any filesystem I think doesn't provide such a guarantee either: https://www.kernel.org/doc/ols/2006/ols2006v2-pages-59-72.pdf . It seems a workaround to make nfs guarantee the file appearance order to Nextflow would be to disable caching, but this seems to have a very high performance cost:
Some documents recommend turning off caching entirely, by mounting the file system with the noac option, but this is really a desparate measure, because it kills performance completely.
Does the 19.03.0-SNAPSHOT
change the expectation of guaranteed order of appearance of files from the filesystem? If not I think the issue will still exist?
Rather, it seems Nextflow currently requires the filesystem to guarantee that files appear in the same order they were written (output files, as they were written before the exitcode file, must appear before the exitcode file does).
Not sure from what you are concluding this?
This is how I understand how Nextflow is working, and where the conclusion comes from:
1) Nextflow will poll the LSF scheduler periodically (using bjobs) for the completion of the process executing on LSF, and once the process is finished, check for the presence of the exit code file
2) If the .exitcode
file is not found after LSF indicates the job is complete, Nextflow will wait indefinitely for the .exitcode
file (user configurable by executor.exitReadTimeout
, > 2 minutes possible).
3) Once the exit code file is detected, the main Nextflow program will immediately (<1ms?) check for the presence of the output files. If the output files have not appeared yet, Nextflow does not wait for them, and the process will fail.
Since the writing of the .exitcode
comes after execution of the .command.sh file & the production of the output files within the .command.sh script, the .exitcode
file will always be written after the output files are written. But the appearance of the files on the filesystem does not necessarily correspond to the order that they are written, e.g. the stated non-guarantee of the ordering of any directory operation in certain filesystems as cited in: https://danluu.com/file-consistency/
The 1.
2.
3.
execution logic uses the assumption that the filesystem provides a guarantee that the files appear on the filesystem in the same order they were written (.exitcode
file after output files), as it considers any case where the .exitcode
file is found, but the output files are not found, is an error. Usually, this is a good assumption, as probably 99.99% of the time missing output files just means the process output directives are set to the wrong value(s), and the output files would never be made by the script.
But, I think that rarely, in certain filesystems, including NFS on top of local filesystems which presumably do provide this order-of-operations guarantee, that performance caching allows for the filesystem to break the guarantee that the order file appear ('directory operations') matches the the order the original files were written. This "out-of-order" situation then leads to a failure in step 3.
, as the output files have not yet appeared. The change I propose above is if Nextflow waited for a short time in step 3.
(~10 seconds), then the files would presumably appear, and then step 3.
could return successfully.
Does it sound like I have understood things correctly? I've been discussing with our sys admins whether they could put in a ticket to the company that develops our network filesystem hardware + software (Dell EMC Isilon), so want to be sure on the right track here. According to our sys admins the Isilon hardware was recently updated (<2 years), and the software is running the latest version, so I don't think it is a misconfiguration issue, or at least, the default current configuration implemented by Isilon would have this issue.
OK, got your point. The only thing is that I don't like the idea to have a fixed 10 secs delay. Ideally it should be fail-retry mechanism with an exponential delay. You may want to open a separate enhancement request for this.
Thanks Paolo, I submitted a new issue on this #1078 . I agree a fixed 10 seconds delay is non ideal. In the suggestion I posted above (https://github.com/nextflow-io/nextflow/issues/931#issuecomment-469692428), I polled every 250 ms, but indeed an exponential delay might be better.
I was doing testing on a small dataset. Never saw this problem before with small or large datasets. But today, I saw the same problem. The files are clearly there (after the pipeline stop and marked as success). There must be a delay today for my NFS. Not sure this issue is resolved in the latest version. My version: version 0.30.2 build 4868
@kemin711 There's an important patch in the latest version (19.04.0), see https://github.com/nextflow-io/nextflow/issues/1045. If the problem persists, open a new issue.
My nextflow runs on LSF cluster occasionally fail with this error message:
However, the file does exist:
I used this command to run the workflow:
And this is my nextflow.config:
I have attached the .nextflow.log file. nextflow_log.txt.zip
In the log file I can see that the kallisto process was submitted at 18:38:15.276 under LSF job id 5173092 and at 18:41:57.502 nextflow found that 5173092 was not anymore in the LSF queue and therefore it had finished. At 18:41:57.556 nextflow reported the error that the output file was missing. I suspect this error occurred due to the network filesystem latency, but nextflow should have waited for 2 minutes according to my executor.exitReadTimeout settings, or am I missing something?