nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.78k stars 632 forks source link

Nextflow with Azure batch appears to fail when reading from multiple containers #5448

Open zjupNN opened 3 weeks ago

zjupNN commented 3 weeks ago

Bug report

Nextflow with Azure batch appears to fail when reading from multiple containers. The files are reported by processes as not existing, and .fusion.log contains messages about 403 authentication errors. This is apparently similar to a previously fixed issue, but persists in 24.10.0, so it may be a different cause?

See also discussion on Slack.

Expected behavior and actual behavior

We expected that it was possible to read form multiple Azure containers in the same workflow; it seems not to be.

Steps to reproduce the problem

Here is a small workflow to illustrate the problem:

process multi {
  conda "conda-forge::gawk"

  input:
  path(p1)
  path(p2)

  output:
  path("both.txt")

  """
  cat ${p1} ${p2} > both.txt
  """
}

workflow {
  p1 = Channel.fromPath(params.p1)
  p2 = Channel.fromPath(params.p2)
  multi(p1, p2)
}

Running

nextflow run main.nf \
  -profile azure_batch \
  -w az://output/multi \
  --p1 az://input1/foo.txt \
  --p2 az://input2/bar.txt

fails, whereas

nextflow run main.nf \
  -profile azure_batch \
  -w az://output/multi \
  --p1 az://output/foo.txt \
  --p2 az://output/bar.txt

works fine.

The config in question, containing the azure_batch profile (with some redacted info):

nextflow.enable.moduleBinaries = true

process {
    resourceLimits = [ cpus: 128, memory: 200.GB, time: 24.h ]

    errorStrategy = { task.exitStatus in [143, 137, 104, 134, 139] ? 'retry' : 'finish' }
    maxRetries = 1
    maxErrors = '-1'

    cpus = { 1 * task.attempt }
    memory = { 10.GB * task.attempt }
    time = { 12.h * task.attempt }
}

profiles {
  azure_batch {
    process {
      executor = 'azurebatch'
      machineType = "Standard_D2_v3,Standard_D4_v3,Standard_D8_v3,Standard_D16_v3,Standard_D32_v3"
    }

    managedIdentity {
          system = true
      }

        wave {
            enabled = true
            strategy = ['conda']
        }

        fusion { 
            enabled = true
            exportStorageCredentials = true
        }

    azure {
      managedIdentity {
        system = true
      }

      storage {
        accountName = '[...]'
      }

      batch {
        location = '[...]'
        accountName = '[...]'

        autoPoolMode = true
        deletePoolsOnCompletion = true

        pools {
                auto {
           autoScale = true
              vmCount = 1
              maxVmCount = 100
                       virtualNetwork = '[...]'
                    }
         }
      }
    }
  }
}

Program output

Running nextflow prints:

executor >  azurebatch (fusion enabled) (1)
[22/bfb52c] multi (1) [100%] 1 of 1, failed: 1
Execution cancelled -- Finishing pending tasks before exit
ERROR ~ Error executing process > 'multi (1)'

Caused by:
  The task exited with an exit code representing a failure

Command executed:

  cat foo.txt bar.txt > both.txt

Command exit status:
  1

Command output:
  (empty)

Command error:
  + cat foo.txt bar.txt
  cat: foo.txt: No such file or directory
  cat: bar.txt: No such file or directory

Work dir:
  [...]

Container:
  [...]

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

-- Check '.nextflow.log' file for details

The .nextflow.log does not contain anything that stands out, whereas the .fusion.log contains:

RESPONSE 403: 403 Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.

Environment

Additional context

We have not been able to verify whether the problem is fusion-related. The pipeline still fails (with a similar but different error message) when running with fusion.enabled: false, but it has been difficult to diagnose whether this is the same issue or an unrelated problem with getting azcopy to where it needs to be during execution.

pditommaso commented 3 weeks ago

Duplicate of #5444 (?)

zjupNN commented 3 weeks ago

Yes, the issue #5444 was created based on the slack discussion related to this issue - here I've just put the input from data scientists to get an overview how it was found.