nextflow-io / nf-nomad

Hashicorp Nomad executor plugin for Nextflow
https://nextflow-io.github.io/nf-nomad/
Apache License 2.0
2 stars 2 forks source link

Error running nf-nomad with acl enabled #56

Open jhaezebr opened 2 days ago

jhaezebr commented 2 days ago

Nextflow seems to be unable to submit jobs when ACL is enabled, but using the same token I can submit a job using the nomad CLI.

Nextflow log ``` Jul-03 12:13:27.492 [main] DEBUG nextflow.cli.Launcher - $> nextflow run hello -c nomad.config -w ./work Jul-03 12:13:27.870 [main] DEBUG nextflow.cli.CmdRun - N E X T F L O W ~ version 24.04.2 Jul-03 12:13:27.930 [main] DEBUG nextflow.plugin.PluginsFacade - Setting up plugin manager > mode=prod; embedded=false; plugins-dir=/home/research/.nextflow/plugins; core-plugins: nf-amazon@2.5.2,nf-azure@1.6.0,nf-cloudcache@0.4.1,nf-codecommit@0.2.0,nf-console@1.1.3,nf-ga4gh@1.3.0,nf-google@1.13.2,nf-tower@1.9.1,nf-wave@1.4.2 Jul-03 12:13:28.014 [main] INFO o.pf4j.DefaultPluginStatusProvider - Enabled plugins: [] Jul-03 12:13:28.016 [main] INFO o.pf4j.DefaultPluginStatusProvider - Disabled plugins: [] Jul-03 12:13:28.025 [main] INFO org.pf4j.DefaultPluginManager - PF4J version 3.10.0 in 'deployment' mode Jul-03 12:13:28.189 [main] INFO org.pf4j.AbstractPluginManager - No plugins Jul-03 12:13:28.232 [main] DEBUG nextflow.scm.ProviderConfig - Using SCM config path: /home/research/.nextflow/scm Jul-03 12:13:28.253 [main] DEBUG nextflow.scm.AssetManager - Listing projects in folder: /home/research/.nextflow/assets Jul-03 12:13:30.130 [main] DEBUG nextflow.scm.AssetManager - Git config: /home/research/.nextflow/assets/nextflow-io/hello/.git/config; branch: master; remote: origin; url: https://github.com/nextflow-io/hello.git Jul-03 12:13:30.344 [main] DEBUG nextflow.scm.RepositoryFactory - Found Git repository result: [RepositoryFactory] Jul-03 12:13:30.389 [main] DEBUG nextflow.scm.AssetManager - Git config: /home/research/.nextflow/assets/nextflow-io/hello/.git/config; branch: master; remote: origin; url: https://github.com/nextflow-io/hello.git Jul-03 12:13:32.835 [main] DEBUG nextflow.config.ConfigBuilder - Found config home: /home/research/.nextflow/config Jul-03 12:13:32.837 [main] DEBUG nextflow.config.ConfigBuilder - Found config base: /home/research/.nextflow/assets/nextflow-io/hello/nextflow.config Jul-03 12:13:32.849 [main] DEBUG nextflow.config.ConfigBuilder - User config file: /scratch/nf-nomad/nomad.config Jul-03 12:13:32.852 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /home/research/.nextflow/config Jul-03 12:13:32.853 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /home/research/.nextflow/assets/nextflow-io/hello/nextflow.config Jul-03 12:13:32.854 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /scratch/nf-nomad/nomad.config Jul-03 12:13:32.892 [main] DEBUG n.secret.LocalSecretsProvider - Secrets store: /home/research/.nextflow/secrets/store.json Jul-03 12:13:32.900 [main] DEBUG nextflow.secret.SecretsLoader - Discovered secrets providers: [nextflow.secret.LocalSecretsProvider@2b736fee] - activable => nextflow.secret.LocalSecretsProvider@2b736fee Jul-03 12:13:32.912 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard` Jul-03 12:13:33.202 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard` Jul-03 12:13:33.274 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard` Jul-03 12:13:33.744 [main] DEBUG nextflow.cli.CmdRun - Applied DSL=2 by global default Jul-03 12:13:33.751 [main] DEBUG nextflow.cli.CmdRun - Launching `https://github.com/nextflow-io/hello` [disturbed_shannon] DSL2 - revision: 7588c46ffe [master] Jul-03 12:13:33.756 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins declared=[nf-nomad@0.1.1] Jul-03 12:13:33.758 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins default=[] Jul-03 12:13:33.760 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins resolved requirement=[nf-nomad@0.1.1] Jul-03 12:13:33.761 [main] DEBUG nextflow.plugin.PluginUpdater - Installing plugin nf-nomad version: 0.1.1 Jul-03 12:13:33.798 [main] INFO org.pf4j.AbstractPluginManager - Plugin 'nf-nomad@0.1.1' resolved Jul-03 12:13:33.798 [main] INFO org.pf4j.AbstractPluginManager - Start plugin 'nf-nomad@0.1.1' Jul-03 12:13:33.862 [main] DEBUG nextflow.plugin.BasePlugin - Plugin started nf-nomad@0.1.1 Jul-03 12:13:34.025 [main] DEBUG nextflow.Session - Session UUID: 52aae5fc-1036-4f86-af10-e5633ac019f5 Jul-03 12:13:34.026 [main] DEBUG nextflow.Session - Run name: disturbed_shannon Jul-03 12:13:34.026 [main] DEBUG nextflow.Session - Executor pool size: 80 Jul-03 12:13:34.047 [main] DEBUG nextflow.file.FilePorter - File porter settings maxRetries=3; maxTransfers=50; pollTimeout=null Jul-03 12:13:34.063 [main] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'FileTransfer' minSize=10; maxSize=240; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false Jul-03 12:13:34.134 [main] DEBUG nextflow.cli.CmdRun - Version: 24.04.2 build 5914 Created: 29-05-2024 06:19 UTC System: Linux 5.4.0-150-generic Runtime: Groovy 4.0.21 on OpenJDK 64-Bit Server VM 11.0.23-internal+0-adhoc..src Encoding: UTF-8 (UTF-8) Process: 59747@compute-87hs7j2 [127.0.1.1] CPUs: 80 - Mem: 629.8 GB (13.6 GB) - Swap: 4 GB (3.6 GB) Jul-03 12:13:34.273 [main] DEBUG nextflow.Session - Work-dir: /scratch/nf-nomad/work [ceph] Jul-03 12:13:34.274 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /home/research/.nextflow/assets/nextflow-io/hello/bin Jul-03 12:13:34.331 [main] DEBUG nextflow.executor.ExecutorFactory - Extension executors providers=[NomadExecutor] Jul-03 12:13:34.369 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory Jul-03 12:13:34.506 [main] DEBUG nextflow.cache.CacheFactory - Using Nextflow cache factory: nextflow.cache.DefaultCacheFactory Jul-03 12:13:34.545 [main] DEBUG nextflow.util.CustomThreadPool - Creating default thread pool > poolSize: 81; maxThreads: 1000 Jul-03 12:13:34.749 [main] DEBUG nextflow.Session - Session start Jul-03 12:13:35.455 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution Jul-03 12:13:35.736 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: nomad Jul-03 12:13:35.736 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'nomad' Jul-03 12:13:35.744 [main] DEBUG nextflow.executor.Executor - [warm up] executor > nomad Jul-03 12:13:35.765 [main] DEBUG n.processor.TaskPollingMonitor - Creating task monitor for executor 'nomad' > capacity: 100; pollInterval: 5s; dumpInterval: 5m Jul-03 12:13:35.771 [main] DEBUG n.processor.TaskPollingMonitor - >>> barrier register (monitor: nomad) Jul-03 12:13:36.185 [main] DEBUG n.nomad.executor.NomadService - [NOMAD] Client Address: http://nomad.ops.cmgg.be/v1 Jul-03 12:13:36.186 [main] DEBUG n.nomad.executor.NomadService - [NOMAD] Client Token: 4465a.. Jul-03 12:13:36.549 [main] DEBUG nextflow.Session - Workflow process names [dsl2]: sayHello Jul-03 12:13:36.550 [main] DEBUG nextflow.Session - Igniting dataflow network (2) Jul-03 12:13:36.552 [main] DEBUG nextflow.processor.TaskProcessor - Starting process > sayHello Jul-03 12:13:36.564 [main] DEBUG nextflow.script.ScriptRunner - Parsed script files: Script_45e06ae60646ee81: /home/research/.nextflow/assets/nextflow-io/hello/main.nf Jul-03 12:13:36.565 [main] DEBUG nextflow.script.ScriptRunner - > Awaiting termination Jul-03 12:13:36.565 [main] DEBUG nextflow.Session - Session await Jul-03 12:13:38.298 [Actor Thread 8] INFO nextflow.processor.TaskProcessor - [sayHello (4)] cache hash: 233d257343efe6e16bd7c6104c229955; mode: STANDARD; entries: 264bf2d524d18f4ce02bfcc59170f616 [java.util.UUID] 52aae5fc-1036-4f86-af10-e5633ac019f5 3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello ee0a1d23a8c26fdf4d1575310833774f [java.lang.String] """ echo '$x world!' """ 20edf49cb4b22a20a5e05a9d1144bf0f [java.lang.String] quay.io/nextflow/bash 769f897d21d56476ad01edc930becff0 [java.lang.String] x f5e76d4e64af0c5d859ff08ab3b720b7 [java.lang.String] Hola 4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ 16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true Jul-03 12:13:38.275 [Actor Thread 7] INFO nextflow.processor.TaskProcessor - [sayHello (3)] cache hash: 7121055b03c0817999f33638f4237c5d; mode: STANDARD; entries: 264bf2d524d18f4ce02bfcc59170f616 [java.util.UUID] 52aae5fc-1036-4f86-af10-e5633ac019f5 3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello ee0a1d23a8c26fdf4d1575310833774f [java.lang.String] """ echo '$x world!' """ 20edf49cb4b22a20a5e05a9d1144bf0f [java.lang.String] quay.io/nextflow/bash 769f897d21d56476ad01edc930becff0 [java.lang.String] x 0ab6632d52e811e9ef7c044666ac496a [java.lang.String] Hello 4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ 16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true Jul-03 12:13:38.357 [Actor Thread 4] INFO nextflow.processor.TaskProcessor - [sayHello (1)] cache hash: 5c5ceeed61a78867efbf73384c00380e; mode: STANDARD; entries: 264bf2d524d18f4ce02bfcc59170f616 [java.util.UUID] 52aae5fc-1036-4f86-af10-e5633ac019f5 3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello ee0a1d23a8c26fdf4d1575310833774f [java.lang.String] """ echo '$x world!' """ 769f897d21d56476ad01edc930becff0 [java.lang.String] x c9273e5a7ac3508ef910437c4bb35a90 [java.lang.String] Bonjour 4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ 16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true Jul-03 12:13:38.298 [Actor Thread 6] INFO nextflow.processor.TaskProcessor - [sayHello (2)] cache hash: c607458338b72c0746d6fcac6772aa62; mode: STANDARD; entries: 264bf2d524d18f4ce02bfcc59170f616 [java.util.UUID] 52aae5fc-1036-4f86-af10-e5633ac019f5 3a5266cb2487ca6ddc8c22a42478f272 [java.lang.String] sayHello ee0a1d23a8c26fdf4d1575310833774f [java.lang.String] """ echo '$x world!' """ 20edf49cb4b22a20a5e05a9d1144bf0f [java.lang.String] quay.io/nextflow/bash 769f897d21d56476ad01edc930becff0 [java.lang.String] x 442e002ddd8b0a2b10ed51352f8c0488 [java.lang.String] Ciao 4f9d4b0d22865056c37fb6d9c2a04a67 [java.lang.String] $ 16fe7483905cce7a85670e43e4678877 [java.lang.Boolean] true Jul-03 12:13:38.649 [Task submitter] DEBUG n.nomad.executor.NomadTaskHandler - [NOMAD] Submitting task sayHello (2) - work-dir=/scratch/nf-nomad/work/70/ecf3dfb7e0c167b38d4183e81c87fa Jul-03 12:13:39.197 [Task submitter] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for task: name=sayHello (2); work-dir=/scratch/nf-nomad/work/70/ecf3dfb7e0c167b38d4183e81c87fa error [nextflow.exception.ProcessSubmitException]: [NOMAD] Failed to submit sayHello (2) -- Cause: Forbidden Jul-03 12:13:39.256 [Task submitter] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: /scratch/nf-nomad/work/70/ecf3dfb7e0c167b38d4183e81c87fa/.command.log Jul-03 12:13:39.269 [Task submitter] ERROR nextflow.processor.TaskProcessor - Error executing process > 'sayHello (2)' Caused by: Forbidden Command executed: echo 'Ciao world!' Command exit status: - Command output: (empty) Work dir: /scratch/nf-nomad/work/70/ecf3dfb7e0c167b38d4183e81c87fa Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out` Jul-03 12:13:39.274 [Task submitter] DEBUG nextflow.Session - Session aborted -- Cause: [NOMAD] Failed to submit sayHello (2) -- Cause: Forbidden Jul-03 12:13:39.360 [Task submitter] DEBUG nextflow.Session - The following nodes are still active: [operator] view Jul-03 12:13:39.409 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: nomad) - terminating tasks monitor poll loop Jul-03 12:13:39.428 [main] DEBUG nextflow.Session - Session await > all processes finished Jul-03 12:13:39.428 [main] DEBUG nextflow.Session - Session await > all barriers passed Jul-03 12:13:39.446 [main] DEBUG n.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=0; failedCount=0; ignoredCount=0; cachedCount=0; pendingCount=4; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=0ms; failedDuration=0ms; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=0; peakCpus=0; peakMemory=0; ] Jul-03 12:13:39.697 [main] DEBUG nextflow.cache.CacheDB - Closing CacheDB done Jul-03 12:13:39.745 [main] INFO org.pf4j.AbstractPluginManager - Stop plugin 'nf-nomad@0.1.1' Jul-03 12:13:39.745 [main] DEBUG nextflow.plugin.BasePlugin - Plugin stopped nf-nomad Jul-03 12:13:39.753 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye ```
Nextflow config ``` dumpHashes = true plugins { id 'nf-nomad@0.1.1' } process { executor = "nomad" docker.enabled = true } nomad { client { address = "http://nomad.example.com" token = "XXXXXXXXXXXXXXXXXXX" } jobs { deleteOnCompletion = false namespace = "nextflow" datacenters = ['dc'] volumes = [ { type "csi" name "nf_scratch_volume" path "/scratch" }, { type "csi" name "nf_reference_volume" path "/references" } ] } } ```
Nomad log ``` 2024-07-03T12:13:39.167Z [TRACE] nomad.job: job mutate results: mutator=canonicalize warnings=[] error= 2024-07-03T12:13:39.167Z [TRACE] nomad.job: job mutate results: mutator=connect warnings=[] error= 2024-07-03T12:13:39.167Z [TRACE] nomad.job: job mutate results: mutator=expose-check warnings=[] error= 2024-07-03T12:13:39.167Z [TRACE] nomad.job: job mutate results: mutator=constraints warnings=[] error= 2024-07-03T12:13:39.167Z [TRACE] nomad.job: job mutate results: mutator=node-pool-mutation warnings=[] error= 2024-07-03T12:13:39.167Z [TRACE] nomad.job: job validate results: validator=connect warnings=[] error= 2024-07-03T12:13:39.167Z [TRACE] nomad.job: job validate results: validator=expose-check warnings=[] error= 2024-07-03T12:13:39.167Z [TRACE] nomad.job: job validate results: validator=vault warnings=[] error= 2024-07-03T12:13:39.167Z [TRACE] nomad.job: job validate results: validator=namespace-constraint-check warnings=[] error= 2024-07-03T12:13:39.167Z [TRACE] nomad.job: job validate results: validator=node-pool-validation warnings=[] error= 2024-07-03T12:13:39.167Z [TRACE] nomad.job: job validate results: validator=validate warnings=[] error= 2024-07-03T12:13:39.167Z [TRACE] nomad.job: job validate results: validator=memory_oversubscription warnings=[] error= 2024-07-03T12:13:39.168Z [DEBUG] http: request failed: method=POST path=/v1/jobs?namespace=nextflow error="Permission denied" code=403 2024-07-03T12:13:39.168Z [DEBUG] http: request complete: method=POST path=/v1/jobs?namespace=nextflow duration=1.19574ms ```
Manual run ``` $ export NOMAD_TOKEN='XXXXXXXXXXXX' $ export NOMAD_ADDR="http://nomad.example.com" $ export NOMAD_NAMESPACE=nextflow $ export NOMAD_DC=s10 $ nomad job run test.hcl ==> Monitoring evaluation "02b6eef0" Evaluation triggered by job "example" Evaluation within deployment: "4d4d3f64" Allocation "984a1dcb" created: node "57dfcfcd", group "example" Evaluation status changed: "pending" -> "complete" ==> Evaluation "02b6eef0" finished with status "complete" $ nomad job status ID Type Priority Status Submit Date example service 50 running 2024-07-03T12:12:42Z $ cat test.hcl job "example" { group "example" { task "sleep" { driver = "docker" config { image = "busybox:latest" entrypoint = ["/bin/sleep", "300"] } resources { cpu = 500 memory = 256 } } } } ```
Nomad nextflow ACL ``` namespace "nextflow" { policy = "write" } agent { policy = "deny" } operator { policy = "deny" } quota { policy = "deny" } node { policy = "deny" } host_volume "*" { policy = "deny" } plugin { policy = "deny" } ```
jagedn commented 2 days ago

can you check mounting a volume in test.hcl please?

Not sure (yet) how acl works but the host_volume in your example is "deny" and the nf-task requires to mount the volume

jagedn commented 2 days ago

I've tested against the local cluster created in the validation folder

( see https://github.com/nextflow-io/nf-nomad/pull/57 )

When the --secure flag is provided the cluster is bootstraping with ACL and the NOMAD_TOKEN is required to run the pipelines

jhaezebr commented 2 days ago

So to utilize csi volumes you at least need the plugin read permissions and csi-list-volume capability.

Updated policy ``` namespace "nextflow" { policy = "write" capabilities = [ "csi-write-volume", "csi-read-volume", "csi-list-volume", "csi-mount-volume" ] } agent { policy = "deny" } operator { policy = "deny" } quota { policy = "deny" } node { policy = "deny" } host_volume "*" { policy = "deny" } plugin { policy = "read" } ```

Other than that there is still a problem with volumes that are read-only

  capability {
    access_mode     = "multi-node-reader-only"
    attachment_mode = "file-system"
  }

  mount_options {
    mount_flags = [ "ro" ]
  }
jagedn commented 2 days ago

we're mounting (all) the volumes as writable

taskDef.config.mount = [ type : "volume", target : destinationDir, source : config.jobOpts().dockerVolume, readonly : false ]

so probably we need to extend our dsl spec with more features

abhi18av commented 2 days ago

@jhaezebr what's the overall use-case for read-only file systems in your setup?

matthdsm commented 1 day ago

@jagedn we use a read only mount for our reference store. This isn't strictly needed, but we want this mount to be read-only so a rogue process can't go about deleting or changing any of the references.

jhaezebr commented 1 day ago

I've made a seperate issue for the read-only use-case: https://github.com/nextflow-io/nf-nomad/issues/60 I'll focus on the ACL part here :)

jhaezebr commented 1 day ago

For the moment this ACL seems to work for nextflow:

namespace "nextflow" {
  policy = "write"
  capabilities = [
    "csi-write-volume",
    "csi-read-volume",
    "csi-list-volume",
    "csi-mount-volume"
  ]
}

agent {
  policy = "deny"
}

operator {
  policy = "deny"
}

quota {
  policy = "deny"
}

node {
  policy = "deny"
}

host_volume "*" {
  policy = "deny"
}

plugin {
  policy = "read"
}
abhi18av commented 1 day ago

Gotcha - thanks @jhaezebr !

Quick question, did you test with fusionfs setup or just CSI?

Judging from the following, I think as fusionfs requires the use of tmp, this could be a blocker.

host_volume "*" {
  policy = "deny"
}

Ideally, we want to keep feature parity with both 🤝

jhaezebr commented 21 hours ago

No, I didn't test fusionfs, just csi. We don't use fusionfs in our cluster and I'm not familiar with it.