nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.73k stars 626 forks source link

Issue using non-root user with google-batch executor #4880

Open JohnWalshTempus opened 6 months ago

JohnWalshTempus commented 6 months ago

Bug report

Expected behavior and actual behavior

The GCP Batch executor (google-batch) should allow non-root users for improved security concerns. Today, only the root user can access files under /mnt/disks/**

Steps to reproduce the problem

I have pushed two public docker images, one with root as the default user, another with worker as the default user.

These can be found at on dockerhub at:

FROM debian:buster-slim

RUN apt-get update 
RUN apt-get upgrade -y 
RUN apt-get install -y curl make wget ca-certificates && rm -rf /var/lib/apt/lists/*

RUN apt-get update \
    && apt-get -yqq install \
    libhdf5-dev jq procps -y

RUN adduser --disabled-login worker # commented out for root-default user image

RUN mkdir /app

# Set ownership and permissions for directories
RUN chown -R worker:worker /bin/ /app /mnt/ /tmp/ && \. # commented out for root-default user image
    chmod -R +x /mnt/ /bin/ /app/ /tmp/   # commented out for root-default user image

USER worker # commented out for root-default user image

WORKDIR /app/

ENTRYPOINT []
CMD []

The workflow I am running is as follows

main.nf:

#!/usr/bin/env nextflow

process HELLO {
  input: 
    val x

  script:
    """
    echo '$x world!'
    """
}

workflow {
  input_channel = Channel.of('Hello')
  input_channel | HELLO
}

nextflow.config:

workDir = 'gs://<my-bucekt>/workshop'

process {
    executor = 'google-batch'
    //  container = 'jvwalsh/nextflow-non-root-user:latest'
    container = 'jvwalsh/nextflow-non-root-user:latest'
}

google {
    <my google conf>
}

Program output

The execution is successful in the root image, while the non-root image gives the following error in the GCP Batch logs:

/bin/bash: /mnt/disks/<my-bucket>/workshop/c8/3e1a8fe217e72f82128617f99061d3/.command.run: Permission denied
cp: failed to access '/mnt/disks/<my-bucket>/workshop/c8/3e1a8fe217e72f82128617f99061d3/.command.log': Permission denied
nextflow.log ``` Apr-03 13:08:58.219 [main] DEBUG nextflow.cli.Launcher - $> nextflow run . Apr-03 13:08:58.274 [main] INFO nextflow.cli.CmdRun - N E X T F L O W ~ version 23.10.1 Apr-03 13:08:58.288 [main] DEBUG nextflow.plugin.PluginsFacade - Setting up plugin manager > mode=prod; embedded=false; plugins-dir=/Users/John.Walsh/.nextflow/plugins; core-plugins: nf-amazon@2.1.4,nf-azure@1.3.3,nf-cloudcache@0.3.0,nf-codecommit@0.1.5,nf-console@1.0.6,nf-ga4gh@1.1.0,nf-google@1.8.3,nf-tower@1.6.3,nf-wave@1.0.1 Apr-03 13:08:58.299 [main] INFO o.pf4j.DefaultPluginStatusProvider - Enabled plugins: [] Apr-03 13:08:58.300 [main] INFO o.pf4j.DefaultPluginStatusProvider - Disabled plugins: [] Apr-03 13:08:58.302 [main] INFO org.pf4j.DefaultPluginManager - PF4J version 3.4.1 in 'deployment' mode Apr-03 13:08:58.309 [main] INFO org.pf4j.AbstractPluginManager - No plugins Apr-03 13:08:58.636 [main] DEBUG nextflow.config.ConfigBuilder - Found config local: /Users/John.Walsh/Learn/nf-minimal-issue/nextflow.config Apr-03 13:08:58.637 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /Users/John.Walsh/Learn/nf-minimal-issue/nextflow.config Apr-03 13:08:58.641 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `standard` Apr-03 13:08:58.681 [main] DEBUG nextflow.cli.CmdRun - Applied DSL=2 by global default Apr-03 13:08:58.706 [main] INFO nextflow.cli.CmdRun - Launching `./main.nf` [furious_plateau] DSL2 - revision: 0f6cc6a24a Apr-03 13:08:58.707 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins default=[nf-google@1.8.3] Apr-03 13:08:58.707 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins resolved requirement=[nf-google@1.8.3] Apr-03 13:08:58.707 [main] DEBUG nextflow.plugin.PluginUpdater - Installing plugin nf-google version: 1.8.3 Apr-03 13:08:58.738 [main] INFO org.pf4j.AbstractPluginManager - Plugin 'nf-google@1.8.3' resolved Apr-03 13:08:58.739 [main] INFO org.pf4j.AbstractPluginManager - Start plugin 'nf-google@1.8.3' Apr-03 13:08:58.812 [main] DEBUG nextflow.plugin.BasePlugin - Plugin started nf-google@1.8.3 Apr-03 13:08:58.823 [main] DEBUG n.secret.LocalSecretsProvider - Secrets store: /Users/John.Walsh/.nextflow/secrets/store.json Apr-03 13:08:58.825 [main] DEBUG nextflow.secret.SecretsLoader - Discovered secrets providers: [nextflow.secret.LocalSecretsProvider@12b5736c] - activable => nextflow.secret.LocalSecretsProvider@12b5736c Apr-03 13:08:58.859 [main] DEBUG nextflow.Session - Session UUID: 11a85f03-9fcc-4ee8-940b-a606ff267b8d Apr-03 13:08:58.860 [main] DEBUG nextflow.Session - Run name: furious_plateau Apr-03 13:08:58.860 [main] DEBUG nextflow.Session - Executor pool size: 10 Apr-03 13:08:59.048 [main] DEBUG nextflow.file.FilePorter - File porter settings maxRetries=3; maxTransfers=50; pollTimeout=null Apr-03 13:08:59.051 [main] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'FileTransfer' minSize=10; maxSize=30; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false Apr-03 13:08:59.268 [main] DEBUG nextflow.cli.CmdRun - Version: 23.10.1 build 5891 Created: 12-01-2024 22:01 UTC (16:01 CDT) System: Mac OS X 13.4.1 Runtime: Groovy 3.0.19 on OpenJDK 64-Bit Server VM 11.0.21+0 Encoding: UTF-8 (UTF-8) Process: 73834@TMREM00010367.local [192.168.1.66] CPUs: 10 - Mem: 16 GB (64.9 MB) - Swap: 9 GB (327.2 MB) Apr-03 13:08:59.286 [main] DEBUG nextflow.Session - Work-dir: gs:///workshop [Mac OS X] Apr-03 13:08:59.286 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /Users/John.Walsh/Learn/nf-minimal-issue/bin Apr-03 13:08:59.305 [main] DEBUG nextflow.executor.ExecutorFactory - Extension executors providers=[GoogleLifeSciencesExecutor, GoogleBatchExecutor] Apr-03 13:08:59.311 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory Apr-03 13:08:59.320 [main] DEBUG nextflow.cache.CacheFactory - Using Nextflow cache factory: nextflow.cache.DefaultCacheFactory Apr-03 13:08:59.326 [main] DEBUG nextflow.util.CustomThreadPool - Creating default thread pool > poolSize: 11; maxThreads: 1000 Apr-03 13:08:59.392 [main] DEBUG nextflow.Session - Session start Apr-03 13:08:59.489 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution Apr-03 13:08:59.537 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: google-batch Apr-03 13:08:59.538 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'google-batch' Apr-03 13:08:59.539 [main] DEBUG nextflow.executor.Executor - [warm up] executor > google-batch Apr-03 13:08:59.542 [main] DEBUG n.processor.TaskPollingMonitor - Creating task monitor for executor 'google-batch' > capacity: 1000; pollInterval: 10s; dumpInterval: 5m Apr-03 13:08:59.544 [main] DEBUG n.processor.TaskPollingMonitor - >>> barrier register (monitor: google-batch) Apr-03 13:08:59.547 [main] DEBUG nextflow.cloud.google.GoogleOpts - Google auth via application DEFAULT Apr-03 13:08:59.549 [main] DEBUG n.c.google.batch.GoogleBatchExecutor - [GOOGLE BATCH] Executor config=BatchConfig[googleOpts=GoogleOpts(projectId:, credsFile:null, location:us-west1, enableRequesterPaysBuckets:false, httpConnectTimeout:1m, httpReadTimeout:1m, credentials:UserCredentials{requestMetadata=null, temporaryAccess=null, clientId=, refreshToken=, tokenServerUri=https://oauth2.googleapis.com/token, transportFactoryClassName=com.google.auth.oauth2.OAuth2Utils$DefaultHttpTransportFactory, quotaProjectId=}) Apr-03 13:08:59.558 [main] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] Creating service client with config credentials Apr-03 13:09:00.161 [main] DEBUG nextflow.Session - Workflow process names [dsl2]: HELLO Apr-03 13:09:00.161 [main] DEBUG nextflow.Session - Igniting dataflow network (2) Apr-03 13:09:00.161 [main] DEBUG nextflow.processor.TaskProcessor - Starting process > HELLO Apr-03 13:09:00.166 [main] DEBUG nextflow.script.ScriptRunner - Parsed script files: Script_48b0a9acaea0588b: /Users/John.Walsh/Learn/nf-minimal-issue/main.nf Apr-03 13:09:00.167 [main] DEBUG nextflow.script.ScriptRunner - > Awaiting termination Apr-03 13:09:00.167 [main] DEBUG nextflow.Session - Session await Apr-03 13:09:03.662 [Task submitter] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Process `HELLO (1)` submitted > job=nf-c83e1a8f-1712167740918; uid=nf-c83e1a8f-171216-d929d349-3ff3-4efa0; work-dir=gs:///workshop/c8/3e1a8fe217e72f82128617f99061d3 Apr-03 13:09:03.662 [Task submitter] INFO nextflow.Session - [c8/3e1a8f] Submitted process > HELLO (1) Apr-03 13:10:39.970 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Process `HELLO (1)` - terminated job=nf-c83e1a8f-1712167740918; state=FAILED Apr-03 13:10:40.273 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Cannot read exit status for task: `HELLO (1)` - gs:///workshop/c8/3e1a8fe217e72f82128617f99061d3/.exitcode Apr-03 13:10:41.011 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 1; name: HELLO (1); status: COMPLETED; exit: -; error: -; workDir: gs:///workshop/c8/3e1a8fe217e72f82128617f99061d3] Apr-03 13:10:41.017 [Task monitor] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for task: name=HELLO (1); work-dir=gs:///workshop/c8/3e1a8fe217e72f82128617f99061d3 error [nextflow.exception.ProcessFailedException]: Process `HELLO (1)` terminated for an unknown reason -- Likely it has been terminated by the external system Apr-03 13:10:41.108 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: gs:///workshop/c8/3e1a8fe217e72f82128617f99061d3/.command.out Apr-03 13:10:41.178 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: gs:///workshop/c8/3e1a8fe217e72f82128617f99061d3/.command.err Apr-03 13:10:41.179 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'HELLO (1)' Caused by: Process `HELLO (1)` terminated for an unknown reason -- Likely it has been terminated by the external system Command executed: echo 'Hello world!' Command exit status: - Command output: (empty) Work dir: gs:///workshop/c8/3e1a8fe217e72f82128617f99061d3 Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run` Apr-03 13:10:41.184 [main] DEBUG nextflow.Session - Session await > all processes finished Apr-03 13:10:41.273 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: Process `HELLO (1)` terminated for an unknown reason -- Likely it has been terminated by the external system Apr-03 13:10:41.373 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: gs:///workshop/c8/3e1a8fe217e72f82128617f99061d3/.command.err Apr-03 13:10:41.450 [Task monitor] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: gs:///workshop/c8/3e1a8fe217e72f82128617f99061d3/.command.out Apr-03 13:10:41.451 [main] DEBUG nextflow.Session - Session await > all barriers passed Apr-03 13:10:41.451 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: google-batch) - terminating tasks monitor poll loop Apr-03 13:10:41.535 [main] DEBUG nextflow.processor.TaskRun - Unable to dump error of process 'null' -- Cause: java.nio.file.NoSuchFileException: gs:///workshop/c8/3e1a8fe217e72f82128617f99061d3/.command.err Apr-03 13:10:41.614 [main] DEBUG nextflow.processor.TaskRun - Unable to dump output of process 'null' -- Cause: java.nio.file.NoSuchFileException: gs:///workshop/c8/3e1a8fe217e72f82128617f99061d3/.command.out Apr-03 13:10:41.620 [main] DEBUG n.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=0; failedCount=1; ignoredCount=0; cachedCount=0; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=0ms; failedDuration=1.1s; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=1; peakCpus=1; peakMemory=0; ] Apr-03 13:10:41.661 [main] DEBUG nextflow.cache.CacheDB - Closing CacheDB done Apr-03 13:10:41.662 [main] INFO org.pf4j.AbstractPluginManager - Stop plugin 'nf-google@1.8.3' Apr-03 13:10:41.662 [main] DEBUG nextflow.plugin.BasePlugin - Plugin stopped nf-google Apr-03 13:10:41.680 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye ```

Additionally, running gcloud beta batch jobs describe projects/<project-id>/locations/us-west1/jobs/<my-nf-job> --format json gives a consistent output like:

image

Environment

Additional context

Other attempts to address the issue

I am wondering if there's a need for an explicit option like the docker executor's fixOwnership

bentsherman commented 6 months ago

Mino type, not sure if it's what you actually tried:

process { containerOptions = "-u 1000:1000" }
JohnWalshTempus commented 6 months ago

Mino type, not sure if it's what you actually tried:

process { containerOptions = "-u 1000:1000" }

Thanks, corrected that but the issue is still there with /mnt/disks/** access

containerOptions are propagating to my batch runnable definition when I describe the job with gcloud:

image
bentsherman commented 6 months ago

Likely need to look at the GCS mount options to see if there is anything related to permissions: https://github.com/nextflow-io/nextflow/blob/fd27fbc16a4a503c7292f3a22a35692c812141f3/plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchScriptLauncher.groovy#L126-L142

If you can submit a job through gcloud and play with these options, and find something that works, it should be trivial to update in Nextflow

JohnWalshTempus commented 6 months ago

So far I've had success with the following but it fails when allow_other is removed. allow_other was removed for some reason back in https://github.com/nextflow-io/nextflow/pull/4332 - I've found these docs on the security implications https://github.com/torvalds/linux/blob/a33f32244d8550da8b4a26e277ce07d5c6d158b5/Documentation/filesystems/fuse.txt#L218-L310

.addAllMountOptions( ['-o rw,allow_other', '--file-mode=777', '--dir-mode=777', '-implicit-dirs'] ) // working option 1
.addAllMountOptions( ['-o rw,allow_other', '--uid=1000', '--gid=1000', '-implicit-dirs'] ) // working option 2