nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.71k stars 621 forks source link

Support scratchless execution with Google Cloud Batch #5125

Closed siddharthab closed 2 months ago

siddharthab commented 2 months ago

Bug report

Expected behavior and actual behavior

Because Nextflow uses Cloud Storage Volumes by default in Cloud Batch, one could assume that scratch is not needed anymore because Cloud Storage Volumes will take care of staging things in scratch space and then moving to Cloud Storage. However, when I try to set process.scratch = false, all my processes fail with messages like: /bin/bash: /mnt/disks/[workdir-bucket]/[workdir-prefix]/[task-id]/.command.sh: Too many levels of symbolic links

Steps to reproduce the problem

nextflow.config:

process.executor = 'google-batch'
process.scratch = false
process.container = 'ubuntu'
docker.enabled = true

google {
  ... project name and location ...
}

main.nf (same as in tutorial):

params.str = 'Hello world!'

process splitLetters {
    output:
    path 'chunk_*'

    """
    printf '${params.str}' | split -b 6 - chunk_
    """
}

process convertToUpper {
    input:
    path x

    output:
    stdout

    """
    cat $x | tr '[a-z]' '[A-Z]'
    """
}

workflow {
    splitLetters | flatten | convertToUpper | view { it.trim() }
}

Run with:

nextflow run main.nf -w gs://[workdir-bucket]/[workdir-prefix]

Program output

executor >  google-batch (1)
[30/7bdc02] process > splitLetters   [100%] 1 of 1, failed: 1 ✘
[-        ] process > convertToUpper -
ERROR ~ Error executing process > 'splitLetters'

Caused by:
  Process `splitLetters` terminated with an error exit status (1)

Command executed:

  printf 'Hello world!' | split -b 6 - chunk_

Command exit status:
  1

Command output:
  /bin/bash: /mnt/disks/[workdir-bucket]/[workdir-prefix]/[task-id]/.command.sh: Too many levels of symbolic links

Command error:
  /bin/bash: /mnt/disks/[workdir-bucket]/[workdir-prefix]/[task-id]/.command.sh: Too many levels of symbolic links

Work dir:
  gs://[workdir-bucket]/[workdir-prefix]/[task-id]

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details

Environment

pditommaso commented 2 months ago

The storage volume is needed exactly for the reason to be used a temporary space when using scratch = true. Setting scratch = false will cause the task to work directly in the bucket via gcsfuse, resulting in the error you are experiencing.

The scratch = false is only supported when using Fusion file system see here for details.

siddharthab commented 2 months ago

By storage volume, I meant GCSFuse. The mounting solution is called "Cloud Storage Volume" in Google Cloud Batch.

The work directory bucket is mounted through GCSFuse already, so I assumed that it is OK for Nextflow to work directly in the mounted directory. And was surprised that it did not work.

I don't see how Fusion and GCSFuse need to be different. They are both Fuse file systems. The documentation for Fusion also says that it enables the work directory to be the mounted cloud directory, foregoing the need for a scratch space.

pditommaso commented 2 months ago

They are both Fuse file systems

That's the same of saying all cars are equals because have four wheels

bentsherman commented 2 months ago

I thought gcsfuse always worked without scratch storage, but now I see that the google batch executor sets scratch to true by default:

https://github.com/nextflow-io/nextflow/blob/12b027ee7e70d65bdee912856478894af4602170/plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchScriptLauncher.groovy#L92-L94

I wonder if this error is the same as #4845

siddharthab commented 2 months ago

Potentially it's related. The problem I am seeing is that the symlink is pointing to itself. I don't know if this bug is coming from Nextflow or from GCSFuse.

% gcloud storage ls --full gs://[REDACTED]-scratch/nextflow-work/sidb-scratch-test/c1/e8b37f991cc3ab2a636e6af8e663e0/.command.sh
gs://[REDACTED]-scratch/nextflow-work/sidb-scratch-test/c1/e8b37f991cc3ab2a636e6af8e663e0/.command.sh:
  Creation Time:               2024-07-17T18:39:45Z
  Update Time:                 2024-07-17T18:39:45Z
  Storage Class Update Time:   2024-07-17T18:39:45Z
  Storage Class:               STANDARD
  Content-Length:              0
  Content-Type:                text/plain; charset=utf-8
  Additional Properties:
  {
    "gcsfuse_symlink_target": "/mnt/disks/[REDACTED]-scratch/nextflow-work/sidb-scratch-test/c1/e8b37f991cc3ab2a636e6af8e663e0/.command.sh"
  }
  Hash (CRC32C):               AAAAAA==
  Hash (MD5):                  1B2M2Y8AsgTpgAmY7PhCfg==
  ETag:                        CMaAmsrcrocDEAE=
  Generation:                  1721241585483846
  Metageneration:              1
  ACL:                         []
TOTAL: 1 objects, 0 bytes (0B)
siddharthab commented 1 month ago

I tried to look into what scratch means in the context of google-batch. It seems like the stage process simply symlinks files from gcsfuse so that's actually equivalent to scratchless behavior. Upon exit, the unstage process will copy files from current directory to the gcsfuse paths. I suppose then the main difference then is that with scratch enabled, all output files start getting written out at the end of the whole process, whereas with scratch disabled, the output files start getting written out as soon they are closed.

A major difference with Fusion would also be the automatic use of local SSDs for /tmp. And of course, Fusion could be more optimized than gcsfuse.

siddharthab commented 1 month ago

I thought gcsfuse always worked without scratch storage, but now I see that the google batch executor sets scratch to true by default

@bentsherman Sent #5256 for the error I encountered. I included some commentary as to what it means to have a scratch dir vs not when using Google Batch.