LSF jobs occasionally fail with missing output file

tomas-pluskal commented 5 years ago

My nextflow runs on LSF cluster occasionally fail with this error message:

ERROR ~ Error executing process > 'kallisto'

Caused by:
 Missing output file(s) `kallisto.isoform.TPM.not_cross_norm` expected by process `kallisto`

However, the file does exist:

$ ls -l /lab/weng_scratch/Tomas/nextflow_assemble/work/9a/b0fc905d6f100d31af1788f26b1d6a/kallisto.isoform.TPM.not_cross_norm
-rw-rw-r-- 1 pluskal wenglab 1747547 Nov 14 18:41 /lab/weng_scratch/Tomas/nextflow_assemble/work/9a/b0fc905d6f100d31af1788f26b1d6a/kallisto.isoform.TPM.not_cross_norm
$ ls -l /lab/weng_scratch/Tomas/nextflow_assemble/kallisto.isoform.TPM.not_cross_norm
-rw-rw-r-- 1 pluskal wenglab 1747547 Nov 14 18:41 /lab/weng_scratch/Tomas/nextflow_assemble/kallisto.isoform.TPM.not_cross_norm

I used this command to run the workflow:

$ bsub NXF_VER=18.12.0-SNAPSHOT nextflow -trace nextflow.executor run annotate.nf

And this is my nextflow.config:

process.executor = "lsf"
executor.queueSize = 1000
executor.exitReadTimeout = "120 sec"
env.PATH = "/lab/solexa_weng/testtube/trinityrnaseq-Trinity-v2.8.4:/lab/solexa_weng/testtube/TransDecoder-TransDecoder-v5.0.2:/lab/solexa_weng/testtube/transrate-1.0.3-linux-x86_64:/lab/solexa_weng/testtube/signalp-4.1:/lab/solexa_weng/testtube/tmhmm-2.0c/bin:/lab/solexa_weng/testtube/ncbi-blast-2.7.1+/bin:/lab/solexa_weng/testtube/bowtie2-2.3.4.3-linux-x86_64:$PATH"
report.enabled = true

I have attached the .nextflow.log file. nextflow_log.txt.zip

In the log file I can see that the kallisto process was submitted at 18:38:15.276 under LSF job id 5173092 and at 18:41:57.502 nextflow found that 5173092 was not anymore in the LSF queue and therefore it had finished. At 18:41:57.556 nextflow reported the error that the output file was missing. I suspect this error occurred due to the network filesystem latency, but nextflow should have waited for 2 minutes according to my executor.exitReadTimeout settings, or am I missing something?

pditommaso commented 5 years ago

Indeed, it does.

Nov-14 18:41:52.392 [Task monitor] TRACE n.executor.AbstractGridExecutor - [LSF] queue status > map does not contain jobId: `5173092`
Nov-14 18:41:52.392 [Task monitor] TRACE nextflow.executor.GridTaskHandler - Exit file does not exist for and the job is not running for task: TaskHandler[jobId: 5173092; id: 337; name: kallisto; status: RUNNING; exit: -; error: -; workDir: /lab/weng_scratch/Tomas/nextflow_assemble/work/9a/b0fc905d6f100d31af1788f26b1d6a started: 1542238727399; exited: -; ] -- Try to wait before kill it

then after 5 secs, find it (exit is 0) and flag the task successfully completed:

Nov-14 18:41:57.502 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[jobId: 5173092; id: 337; name: kallisto; status: COMPLETED; exit: 0; error: -; workDir: /lab/weng_scratch/Tomas/nextflow_assemble/work/9a/b0fc905d6f100d31af1788f26b1d6a started: 1542238727399; exited: 2018-11-14T23:41:43Z; ]

Then, problem is that it cannot find the expected out file, likely for the same reason (latency).

I've seen this happens in some NSF storage, not sure how NF can help here. Try to discuss with your sys admins.

tomas-pluskal commented 5 years ago

So NF waits up to two minutes for the exit code file, but it does not wait for the output files? Don’t you think it should?

pditommaso commented 5 years ago

Waiting for each output file may introduce a huge overhead.

tomas-pluskal commented 5 years ago

Waiting is necessary, as clearly demonstrated by this issue.

Why would it introduce overhead?

pditommaso commented 5 years ago

Because it may require to wait several minutes for each file declared as output. I think this is the signal of a misconfigured environment.

tomas-pluskal commented 5 years ago

You already mentioned you had seen this issue happen before on NFS. It has happened to me several times over last ~2 weeks that I have been playing with NF. So whether it is misconfigured environment or not (debatable), it is a real-life issue that NF has to deal with.

In practice, the filesystem delay will only be in the order of seconds. It will only hit the timeout in cases when the output file is not produced, but that indicates a failed run, where a bit of delay is acceptable.

Another possible solution is to check for the output files on the same node that executed the job. Ie, the script running on that node could produce a .outputfiles file with a list of files that were created. The main control script would then rely on this file instead of checking the filesystem.

pditommaso commented 5 years ago

If you think you have viable solution, feel free to submit a pull-request.

tomas-pluskal commented 5 years ago

I'll give it a try. Might need some help on the way.

tomas-pluskal commented 5 years ago

Just FYI, Snakemake seems to be aware of this issue:

  --latency-wait SECONDS, --output-wait SECONDS, -w SECONDS
                        Wait given seconds if an output file of a job is not
                        present after the job finished. This helps if your
                        filesystem suffers from latency (default 5).

pachiras commented 5 years ago

I also have this "latency problem" with my PBS system. Attaching errorStrategy 'retry' looks working for me. But let me know, tomas-pluskal, if you get a better solution.

photocyte commented 5 years ago

Hi Paolo,

I'm also running into this issue. Would a reasonable solution be to wrap this for loop: https://github.com/nextflow-io/nextflow/blob/21ef7f51f7bde0633c3adbc98b5fd2a8a9e285a1/modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy#L1358 in a small loop to enable the files to be checked for a few times before throwing the exception? E.g. the below line would only be called after the files had been checked and failed several times (say 10 seconds of checking) https://github.com/nextflow-io/nextflow/blob/21ef7f51f7bde0633c3adbc98b5fd2a8a9e285a1/modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy#L1383

Here is a mockup of what the code could look like (I am not a Groovy expert):

int pollIntervalMillis = 250 //Poll for missing files every 250 ms.
        int maxWaitTimeMillis = 10000 //Wait up to 10 seconds
        int i=0
        List<String> entriesToRemove = []
        while (waitedLongEnough != true && len(entries) > 0) {

          if (i * pollIntervalMillis > maxWaitTimeMillis)
              waitedLongEnough = true
          // for each of them collect the produced files
          for( String filePattern : entries ) {
              List<Path> result = null

              def splitter = param.glob ? FilePatternSplitter.glob().parse(filePattern) : null
              if( splitter?.isPattern() ) {
                  result = fetchResultFiles(param, filePattern, workDir)
                  // filter the inputs
                  if( !param.includeInputs ) {
                      result = filterByRemovingStagedInputs(task, result, workDir)
                      log.trace "Process ${task.name} > after removing staged inputs: ${result}"
                  }
              }
              else {
                  def path = param.glob ? splitter.strip(filePattern) : filePattern
                  def file = workDir.resolve(path)
                  def exists = param.followLinks ? file.exists() : file.exists(LinkOption.NOFOLLOW_LINKS)
                  if( exists )
                      result = [file]
                  else
                      log.debug "Process `${task.name}` is unable to find [${file.class.simpleName}]: `$file` (pattern: `$filePattern`)"
              }

              if( result ) {
                  allFiles.addAll(result)
                  entriesToRemove.add(filePattern)
              }

              else if( !param.optional && waitedLongEnough ) //Only throw an exception if the wait time has passed, otherwise loop again
                  throw new MissingFileException("Missing output file(s) `$filePattern` expected by process `${task.name}`")
          }
       //Remove all the successful filePatterns, then loop again.
       for( String e : entriesToRemove) {
              entries = entries.removeElement(e)
       }
       sleep(pollIntervalMillis)
       i+=1
     }

pditommaso commented 5 years ago

@photocyte actually a problem that could be related to this has been spotted, see this comment https://github.com/nextflow-io/nextflow/issues/1045#issuecomment-467789063

photocyte commented 5 years ago

Thank you Paolo, I'll try to test if this issue still exists with the new snapshot. But maybe the issues are different? For example you say:

Looking at the log it looks NF cannot retrieve the job .exitcode file.

In this case, the exit code file is found (after the delay from executor.exitReadTimeout), but the output files have not yet appeared when Nextflow checks, although they show up later (probably < 5 seconds)

It seems that the issue is not filesystem 'latency', exactly, as Nextflow can wait indefinitely for the exit code file with the executor.exitReadTimeout parameter. Rather, it seems Nextflow currently requires the filesystem to guarantee that files appear in the same order they were written (output files, as they were written before the exitcode file, must appear before the exitcode file does).

Based on some reading, I think the ext2 and brtfs local filesystems don't provide this guarantee (https://danluu.com/file-consistency/), and nfs on top of any filesystem I think doesn't provide such a guarantee either: https://www.kernel.org/doc/ols/2006/ols2006v2-pages-59-72.pdf . It seems a workaround to make nfs guarantee the file appearance order to Nextflow would be to disable caching, but this seems to have a very high performance cost:

Some documents recommend turning off caching entirely, by mounting the file system with the noac option, but this is really a desparate measure, because it kills performance completely.

Does the 19.03.0-SNAPSHOT change the expectation of guaranteed order of appearance of files from the filesystem? If not I think the issue will still exist?

pditommaso commented 5 years ago

Rather, it seems Nextflow currently requires the filesystem to guarantee that files appear in the same order they were written (output files, as they were written before the exitcode file, must appear before the exitcode file does).

Not sure from what you are concluding this?

photocyte commented 5 years ago

This is how I understand how Nextflow is working, and where the conclusion comes from:

1) Nextflow will poll the LSF scheduler periodically (using bjobs) for the completion of the process executing on LSF, and once the process is finished, check for the presence of the exit code file 2) If the .exitcode file is not found after LSF indicates the job is complete, Nextflow will wait indefinitely for the .exitcode file (user configurable by executor.exitReadTimeout, > 2 minutes possible). 3) Once the exit code file is detected, the main Nextflow program will immediately (<1ms?) check for the presence of the output files. If the output files have not appeared yet, Nextflow does not wait for them, and the process will fail.

Since the writing of the .exitcode comes after execution of the .command.sh file & the production of the output files within the .command.sh script, the .exitcode file will always be written after the output files are written. But the appearance of the files on the filesystem does not necessarily correspond to the order that they are written, e.g. the stated non-guarantee of the ordering of any directory operation in certain filesystems as cited in: https://danluu.com/file-consistency/

The 1. 2. 3. execution logic uses the assumption that the filesystem provides a guarantee that the files appear on the filesystem in the same order they were written (.exitcode file after output files), as it considers any case where the .exitcode file is found, but the output files are not found, is an error. Usually, this is a good assumption, as probably 99.99% of the time missing output files just means the process output directives are set to the wrong value(s), and the output files would never be made by the script.

But, I think that rarely, in certain filesystems, including NFS on top of local filesystems which presumably do provide this order-of-operations guarantee, that performance caching allows for the filesystem to break the guarantee that the order file appear ('directory operations') matches the the order the original files were written. This "out-of-order" situation then leads to a failure in step 3., as the output files have not yet appeared. The change I propose above is if Nextflow waited for a short time in step 3. (~10 seconds), then the files would presumably appear, and then step 3. could return successfully.

Does it sound like I have understood things correctly? I've been discussing with our sys admins whether they could put in a ticket to the company that develops our network filesystem hardware + software (Dell EMC Isilon), so want to be sure on the right track here. According to our sys admins the Isilon hardware was recently updated (<2 years), and the software is running the latest version, so I don't think it is a misconfiguration issue, or at least, the default current configuration implemented by Isilon would have this issue.

pditommaso commented 5 years ago

OK, got your point. The only thing is that I don't like the idea to have a fixed 10 secs delay. Ideally it should be fail-retry mechanism with an exponential delay. You may want to open a separate enhancement request for this.

photocyte commented 5 years ago

Thanks Paolo, I submitted a new issue on this #1078 . I agree a fixed 10 seconds delay is non ideal. In the suggestion I posted above (https://github.com/nextflow-io/nextflow/issues/931#issuecomment-469692428), I polled every 250 ms, but indeed an exponential delay might be better.

kemin711 commented 5 years ago

I was doing testing on a small dataset. Never saw this problem before with small or large datasets. But today, I saw the same problem. The files are clearly there (after the pipeline stop and marked as success). There must be a delay today for my NFS. Not sure this issue is resolved in the latest version. My version: version 0.30.2 build 4868

pditommaso commented 5 years ago

@kemin711 There's an important patch in the latest version (19.04.0), see https://github.com/nextflow-io/nextflow/issues/1045. If the problem persists, open a new issue.

nextflow-io / nextflow

LSF jobs occasionally fail with missing output file #931