Cannot set SSD disk size with fusion filesystem

Puumanamana commented 1 year ago

Bug report

When using fusion filesystems on google-batch, setting disk doesn't set the size of the SSD for the fusion and uses instead ~350GB SSD. Aside from cost, this is a problem for me since my GCP project have set quotas (I can't use more than some amount of SSD storage) which greatly limits the number of task I can run in parallel. My pipelines end up failing if too many tasks are submitted at the same time.

Expected behavior and actual behavior

Actual: There's no way of setting the size of the SSD Expected: Either setting disk with fusion filesystem enabled sets the size of the SSD, or another directive is available to set the size of the SSD.

Steps to reproduce the problem

The following script fails as expected when fusion is not enabled but not with fusion filesystem enabled (-profile fusion)

main.nf

nextflow.enable.dsl = 2

process MAKE_FILE {
    container "quay.io/biocontainers/fastp:0.23.2--h5f740d0_3"
    disk "20.GB"

    input:
    val size_gb

    output:
    tuple val(size_gb), stdout

    script:
    """
    fallocate -l $size_gb file.txt
    df -h
    """
}

workflow {
    Channel.of("2GB", "30GB") | MAKE_FILE | view
}

nextflow.config

workDir = "gs://nf-tower-public/scratch"

google {
    project = "rome-pipeline-engine"
    region = "us-central1"
}

process {
    executor  = "google-batch"
    errorStrategy = "terminate"
}

profiles {
    fusion { 
        fusion.enabled = true 
        wave.enabled = true
    }
}

Program output

Without fusion, this returns I/O error for the 30GB input as expected (side question: is it expected here to have a null exit status?)

nextflow run main.nf

Output:

N E X T F L O W  ~  version 23.03.0-edge
Launching `main.nf` [spontaneous_ekeblad] DSL2 - revision: 524735c195
executor >  google-batch (2)
executor >  google-batch (2)
[1e/bfe23b] process > MAKE_FILE (1) [100%] 1 of 1, failed: 1
ERROR ~ Error executing process > 'MAKE_FILE (2)'

Caused by:
  Process `MAKE_FILE (2)` terminated with an error exit status (null)

Command executed:

  fallocate -l 30GB file.txt
  df -h

Command exit status:
  null

Command output:
  [Batch Action] Package(s) installation succeeded.
  0.23.2--h5f740d0_3: Pulling from biocontainers/fastp
  c1a16a04cedd: Pulling fs layer
  4ca545ee6d5d: Pulling fs layer
  2692c83a0b20: Pulling fs layer
  4ca545ee6d5d: Verifying Checksum
  4ca545ee6d5d: Download complete
  2692c83a0b20: Verifying Checksum
  2692c83a0b20: Download complete
  c1a16a04cedd: Verifying Checksum
  c1a16a04cedd: Download complete
  c1a16a04cedd: Pull complete
  4ca545ee6d5d: Pull complete
  2692c83a0b20: Pull complete
  Digest: sha256:2489fe56260bde05bdf72a8ead4892033b9a05dc4525affb909405bea7839d1b
  Status: Downloaded newer image for quay.io/biocontainers/fastp:0.23.2--h5f740d0_3
  quay.io/biocontainers/fastp:0.23.2--h5f740d0_3
  fallocate: fallocate 'file.txt': No space left on device
  tee: .command.err: I/O error

Command error:
  tee: .command.log: I/O error

Work dir:
  gs://nf-tower-public/scratch/99/84da57054e1940c229547c39a4764b

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

With fusion, no error:

nextflow main.nf -profile fusion

In addition, here's the stdout for the fusion run:

N E X T F L O W  ~  version 23.03.0-edge
Launching `main.nf` [fabulous_heyrovsky] DSL2 - revision: 524735c195
executor >  google-batch (fusion enabled) (2)
[b6/31c42b] process > MAKE_FILE (1) [100%] 2 of 2 ✔
[30GB, Filesystem                Size      Used Available Use% Mounted on
overlay                  27.3G      1.7G     25.6G   6% /
tmpfs                    64.0M         0     64.0M   0% /dev
shm                      64.0M         0     64.0M   0% /dev/shm
/dev/nvme0n1            368.0G     27.9G    321.3G   8% /tmp
/dev/sda1                27.3G      1.7G     25.6G   6% /etc/resolv.conf
/dev/sda1                27.3G      1.7G     25.6G   6% /etc/hostname
/dev/sda1                27.3G      1.7G     25.6G   6% /etc/hosts
fusion                    8.0P      4.0P      4.0P  50% /fusion
]
[2GB, Filesystem                Size      Used Available Use% Mounted on
overlay                  27.3G      1.7G     25.6G   6% /
tmpfs                    64.0M         0     64.0M   0% /dev
shm                      64.0M         0     64.0M   0% /dev/shm
/dev/nvme0n1            368.0G      1.9G    347.4G   1% /tmp
/dev/sda1                27.3G      1.7G     25.6G   6% /etc/resolv.conf
/dev/sda1                27.3G      1.7G     25.6G   6% /etc/hostname
/dev/sda1                27.3G      1.7G     25.6G   6% /etc/hosts
fusion                    8.0P      4.0P      4.0P  50% /fusion
]
Completed at: 20-Mar-2023 02:34:35
Duration    : 3m 11s
CPU hours   : (a few seconds)
Succeeded   : 2

We can see there's a NVMe disk attached with ~350 GB SSD

.nextflow.log (with fusion)

Mar-20 02:31:23.596 [main] DEBUG nextflow.cli.Launcher - $> nextflow main.nf -profile fusion
Mar-20 02:31:23.664 [main] INFO  nextflow.cli.CmdRun - N E X T F L O W  ~  version 23.03.0-edge
Mar-20 02:31:23.696 [main] DEBUG nextflow.plugin.PluginsFacade - Setting up plugin manager > mode=prod; embedded=false; plugins-dir=/home/cedric/.nextflow/plugins; core-plugins: nf-amazon@1.16.0,nf-azure@0.16.0,nf-codecommit@0.1.3,nf-console@1.0.5,nf-ga4gh@1.0.4,nf-google@1.7.1,nf-tower@1.5.10,nf-wave@0.8.0
Mar-20 02:31:23.709 [main] INFO  org.pf4j.DefaultPluginStatusProvider - Enabled plugins: []
Mar-20 02:31:23.710 [main] INFO  org.pf4j.DefaultPluginStatusProvider - Disabled plugins: []
Mar-20 02:31:23.714 [main] INFO  org.pf4j.DefaultPluginManager - PF4J version 3.4.1 in 'deployment' mode
Mar-20 02:31:23.725 [main] INFO  org.pf4j.AbstractPluginManager - No plugins
Mar-20 02:31:23.747 [main] DEBUG nextflow.config.ConfigBuilder - Found config local: /home/cedric/sandbox/troubleshoot/fusion-ssd-size/nextflow.config
Mar-20 02:31:23.748 [main] DEBUG nextflow.config.ConfigBuilder - Parsing config file: /home/cedric/sandbox/troubleshoot/fusion-ssd-size/nextflow.config
Mar-20 02:31:23.776 [main] DEBUG nextflow.config.ConfigBuilder - Applying config profile: `fusion`
Mar-20 02:31:24.482 [main] DEBUG nextflow.config.ConfigBuilder - Available config profiles: [fusion]
Mar-20 02:31:24.512 [main] DEBUG nextflow.cli.CmdRun - Applied DSL=2 from script declararion
Mar-20 02:31:24.531 [main] INFO  nextflow.cli.CmdRun - Launching `main.nf` [fabulous_heyrovsky] DSL2 - revision: 524735c195
Mar-20 02:31:24.532 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins default=[nf-google@1.7.1]
Mar-20 02:31:24.533 [main] DEBUG nextflow.plugin.PluginsFacade - Plugins resolved requirement=[nf-google@1.7.1, nf-tower@1.5.10, nf-wave@0.8.0]
Mar-20 02:31:24.533 [main] DEBUG nextflow.plugin.PluginUpdater - Installing plugin nf-google version: 1.7.1
Mar-20 02:31:24.543 [main] INFO  org.pf4j.AbstractPluginManager - Plugin 'nf-google@1.7.1' resolved
Mar-20 02:31:24.544 [main] INFO  org.pf4j.AbstractPluginManager - Start plugin 'nf-google@1.7.1'
Mar-20 02:31:24.577 [main] DEBUG nextflow.plugin.BasePlugin - Plugin started nf-google@1.7.1
Mar-20 02:31:24.577 [main] DEBUG nextflow.plugin.PluginUpdater - Installing plugin nf-tower version: 1.5.10
Mar-20 02:31:24.579 [main] INFO  org.pf4j.AbstractPluginManager - Plugin 'nf-tower@1.5.10' resolved
Mar-20 02:31:24.579 [main] INFO  org.pf4j.AbstractPluginManager - Start plugin 'nf-tower@1.5.10'
Mar-20 02:31:24.586 [main] DEBUG nextflow.plugin.BasePlugin - Plugin started nf-tower@1.5.10
Mar-20 02:31:24.587 [main] DEBUG nextflow.plugin.PluginUpdater - Installing plugin nf-wave version: 0.8.0
Mar-20 02:31:24.588 [main] INFO  org.pf4j.AbstractPluginManager - Plugin 'nf-wave@0.8.0' resolved
Mar-20 02:31:24.588 [main] INFO  org.pf4j.AbstractPluginManager - Start plugin 'nf-wave@0.8.0'
Mar-20 02:31:24.594 [main] DEBUG nextflow.plugin.BasePlugin - Plugin started nf-wave@0.8.0
Mar-20 02:31:24.609 [main] DEBUG nextflow.secret.LocalSecretsProvider - Secrets store: /home/cedric/.nextflow/secrets/store.json
Mar-20 02:31:24.612 [main] DEBUG nextflow.secret.SecretsLoader - Discovered secrets providers: [nextflow.secret.LocalSecretsProvider@1c26273d] - activable => nextflow.secret.LocalSecretsProvider@1c26273d
Mar-20 02:31:24.683 [main] DEBUG nextflow.Session - Session UUID: 00ebba61-0018-44be-9b82-73239ef2c4f5
Mar-20 02:31:24.684 [main] DEBUG nextflow.Session - Run name: fabulous_heyrovsky
Mar-20 02:31:24.684 [main] DEBUG nextflow.Session - Executor pool size: 8
Mar-20 02:31:24.904 [main] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'FileTransfer' minSize=10; maxSize=24; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false
Mar-20 02:31:24.936 [main] DEBUG nextflow.cli.CmdRun -
  Version: 23.03.0-edge build 5851
  Created: 19-03-2023 18:07 UTC
  System: Linux 5.15.0-1022-gcp
  Runtime: Groovy 3.0.16 on OpenJDK 64-Bit Server VM 17.0.3-internal+0-adhoc..src
  Encoding: UTF-8 (UTF-8)
  Process: 327560@nf-tower-main [10.128.0.2]
  CPUs: 8 - Mem: 62.8 GB (47 GB) - Swap: 0 (0)
Mar-20 02:31:24.976 [main] DEBUG nextflow.file.FileHelper - Can't check if specified path is NFS (1): gs://nf-tower-public/scratch

Mar-20 02:31:24.976 [main] DEBUG nextflow.Session - Work-dir: gs://nf-tower-public/scratch [null]
Mar-20 02:31:24.976 [main] DEBUG nextflow.Session - Script base path does not exist or is not a directory: /home/cedric/sandbox/troubleshoot/fusion-ssd-size/bin
Mar-20 02:31:25.003 [main] DEBUG nextflow.executor.ExecutorFactory - Extension executors providers=[GoogleLifeSciencesExecutor, GoogleBatchExecutor]
Mar-20 02:31:25.017 [main] DEBUG nextflow.Session - Observer factory: DefaultObserverFactory
Mar-20 02:31:25.020 [main] DEBUG nextflow.Session - Observer factory: WaveFactory
Mar-20 02:31:25.021 [main] DEBUG io.seqera.wave.plugin.WaveFactory - Detected Fusion enabled -- Enabling bundle project resources -- Disabling upload of remote bin directory
Mar-20 02:31:25.021 [main] DEBUG nextflow.Session - Observer factory: TowerFactory
Mar-20 02:31:25.042 [main] DEBUG nextflow.cache.CacheFactory - Using Nextflow cache factory: nextflow.cache.DefaultCacheFactory
Mar-20 02:31:25.053 [main] DEBUG nextflow.util.CustomThreadPool - Creating default thread pool > poolSize: 9; maxThreads: 1000
Mar-20 02:31:25.134 [main] DEBUG nextflow.Session - Session start
Mar-20 02:31:25.385 [main] DEBUG nextflow.script.ScriptRunner - > Launching execution
Mar-20 02:31:25.506 [main] DEBUG nextflow.executor.ExecutorFactory - << taskConfig executor: google-batch
Mar-20 02:31:25.506 [main] DEBUG nextflow.executor.ExecutorFactory - >> processorType: 'google-batch'
Mar-20 02:31:25.509 [main] DEBUG nextflow.executor.Executor - [warm up] executor > google-batch (fusion enabled)
Mar-20 02:31:25.516 [main] DEBUG n.processor.TaskPollingMonitor - Creating task monitor for executor 'google-batch' > capacity: 1000; pollInterval: 10s; dumpInterval: 5m
Mar-20 02:31:25.523 [main] DEBUG nextflow.cloud.google.GoogleOpts - Google auth via application DEFAULT
Mar-20 02:31:25.530 [main] DEBUG n.c.google.batch.GoogleBatchExecutor - [GOOGLE BATCH] Executor config=BatchConfig[googleOpts=GoogleOpts(projectId:rome-pipeline-engine, credsFile:null, location:null, enableRequesterPaysBuckets:false, credentials:ComputeEngineCredentials{transportFactoryClassName=com.google.auth.oauth2.OAuth2Utils$DefaultHttpTransportFactory})
Mar-20 02:31:25.546 [main] DEBUG n.c.google.batch.client.BatchClient - [GOOGLE BATCH] Creating service client with config credentials
Mar-20 02:31:26.229 [main] DEBUG nextflow.Session - Workflow process names [dsl2]: MAKE_FILE
Mar-20 02:31:26.230 [main] DEBUG nextflow.Session - Igniting dataflow network (2)
Mar-20 02:31:26.231 [main] DEBUG nextflow.processor.TaskProcessor - Starting process > MAKE_FILE
Mar-20 02:31:26.232 [main] DEBUG nextflow.script.ScriptRunner - > Awaiting termination
Mar-20 02:31:26.232 [main] DEBUG nextflow.Session - Session await
Mar-20 02:31:26.303 [Actor Thread 4] DEBUG i.s.wave.plugin.config.WaveConfig - Wave strategy not specified - using default: [container, dockerfile, conda]
Mar-20 02:31:26.308 [Actor Thread 4] DEBUG io.seqera.wave.plugin.WaveClient - Wave server endpoint: https://wave.seqera.io
Mar-20 02:31:26.349 [Actor Thread 6] DEBUG io.seqera.wave.plugin.WaveClient - Wave request container config: https://fusionfs.seqera.io/releases/v2.1-amd64.json
Mar-20 02:31:26.628 [Actor Thread 6] DEBUG io.seqera.wave.plugin.WaveClient - Wave container config response: [200] {
  "layers": [
    {
      "location": "https://fusionfs.seqera.io/releases/pkg/2/1/6/fusion-amd64.tar.gz",
      "gzipDigest": "sha256:782f50229060010f4f8e8bb6c52822f3fc95dafef0ca742128998a307a1db0d3",
      "gzipSize": 13522690,
      "tarDigest": "sha256:382627a7a78ba495481489b036a99798e3c6245433c29685b0efdcc4b39740f1",
      "skipHashing": true
    }
  ]
}

Mar-20 02:31:26.686 [Actor Thread 4] DEBUG io.seqera.wave.plugin.WaveClient - Wave request: https://wave.seqera.io/container-token; attempt=1 - request: SubmitContainerTokenRequest(towerAccessToken:eyJ0aWQiOiA2OTgyfS45NTJmZjEwMjhmNjg3NTJkMWJjZmIxNTYyMDg4NmU2ZmQ3YTQ2Yjdl, towerRefreshToken:null, towerWorkspaceId:null, towerEndpoint:https://api.tower.nf, workflowId:null, containerImage:quay.io/biocontainers/fastp:0.23.2--h5f740d0_3, containerFile:null, containerConfig:ContainerConfig(entrypoint:null, cmd:null, env:null, workingDir:null, layers:[ContainerLayer[location=https://fusionfs.seqera.io/releases/pkg/2/1/6/fusion-amd64.tar.gz; tarDigest=sha256:382627a7a78ba495481489b036a99798e3c6245433c29685b0efdcc4b39740f1; gzipDigest=sha256:782f50229060010f4f8e8bb6c52822f3fc95dafef0ca742128998a307a1db0d3; gzipSize=13522690]]), condaFile:null, containerPlatform:null, buildRepository:null, cacheRepository:null, timestamp:2023-03-20T02:31:26.679131706Z, fingerprint:2ba3d7972fa66002a13bc6b72cb84144)
Mar-20 02:31:27.177 [Actor Thread 4] DEBUG io.seqera.wave.plugin.WaveClient - Wave response: statusCode=200; body={"containerToken":"a60db6fb7126","targetImage":"wave.seqera.io/wt/a60db6fb7126/biocontainers/fastp:0.23.2--h5f740d0_3","expiration":"2023-03-21T00:31:27.123921074Z"}
Mar-20 02:31:29.556 [Task submitter] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] submitted > job=nf-a0d34c1b-1679279487439; uid=nf-a0d34c1b-167927-20fc5b98-aeb0-4a580; work-dir=gs://nf-tower-public/scratch/a0/d34c1be77ff9ea3d784292408be193
Mar-20 02:31:29.556 [Task submitter] INFO  nextflow.Session - [a0/d34c1b] Submitted process > MAKE_FILE (2)
Mar-20 02:31:30.230 [Task submitter] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] submitted > job=nf-b631c42b-1679279487439; uid=nf-b631c42b-167927-b8635e3b-b347-4b7c0; work-dir=gs://nf-tower-public/scratch/b6/31c42bebbc0bc9788a170808ef0f99
Mar-20 02:31:30.231 [Task submitter] INFO  nextflow.Session - [b6/31c42b] Submitted process > MAKE_FILE (1)
Mar-20 02:34:25.580 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Terminated job=nf-a0d34c1b-1679279487439; state=SUCCEEDED
Mar-20 02:34:25.707 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 2; name: MAKE_FILE (2); status: COMPLETED; exit: 0; error: -; workDir: gs://nf-tower-public/scratch/a0/d34c1be77ff9ea3d784292408be193]
Mar-20 02:34:35.594 [Task monitor] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Terminated job=nf-b631c42b-1679279487439; state=SUCCEEDED
Mar-20 02:34:35.689 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 1; name: MAKE_FILE (1); status: COMPLETED; exit: 0; error: -; workDir: gs://nf-tower-public/scratch/b6/31c42bebbc0bc9788a170808ef0f99]
Mar-20 02:34:35.801 [main] DEBUG nextflow.Session - Session await > all processes finished
Mar-20 02:34:35.830 [main] DEBUG nextflow.Session - Session await > all barriers passed
Mar-20 02:34:35.836 [main] DEBUG nextflow.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=2; failedCount=0; ignoredCount=0; cachedCount=0; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=1m 50s; failedDuration=0ms; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=2; peakCpus=2; peakMemory=0; ]
Mar-20 02:34:35.985 [main] DEBUG nextflow.cache.CacheDB - Closing CacheDB done
Mar-20 02:34:35.985 [main] INFO  org.pf4j.AbstractPluginManager - Stop plugin 'nf-wave@0.8.0'
Mar-20 02:34:35.985 [main] DEBUG nextflow.plugin.BasePlugin - Plugin stopped nf-wave
Mar-20 02:34:35.985 [main] INFO  org.pf4j.AbstractPluginManager - Stop plugin 'nf-tower@1.5.10'
Mar-20 02:34:35.985 [main] DEBUG nextflow.plugin.BasePlugin - Plugin stopped nf-tower
Mar-20 02:34:35.985 [main] INFO  org.pf4j.AbstractPluginManager - Stop plugin 'nf-google@1.7.1'
Mar-20 02:34:35.985 [main] DEBUG nextflow.plugin.BasePlugin - Plugin stopped nf-google
Mar-20 02:34:36.002 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'FileTransfer' shutdown completed (hard=false)
Mar-20 02:34:36.002 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye

Environment

Nextflow version: 23.03.0-edge

Java version:

$ java -version
openjdk version "17.0.3-internal" 2022-04-19
OpenJDK Runtime Environment (build 17.0.3-internal+0-adhoc..src)
OpenJDK 64-Bit Server VM (build 17.0.3-internal+0-adhoc..src, mixed mode, sharing)

Operating system: [macOS, Linux, etc]

$ uname -a
Linux nf-tower-main 5.15.0-1022-gcp #29~20.04.1-Ubuntu SMP Sat Oct 29 18:17:56 UTC 2022 x86_64 GNU/Linux

Bash version: (use the command $SHELL --version): zsh 5.8 (x86_64-ubuntu-linux-gnu)

pditommaso commented 1 year ago

The disk size is indeed defined in the code.

https://github.com/nextflow-io/nextflow/blob/797b0352d34f51719e851309b7a03d9b28efcff6/plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy#L250-L259

@jordeu you were mentioning about some constraints regarding this, however I think it should be parametrised somehow.

@Puumanamana. I believe your task is failing because the task uses the bootdisk as local temporary work directory.

Try adding process.scratch=false in your nextflow.config file

jordeu commented 1 year ago

Fusion is using local SSD that is different from the network storage boot disk. You cannot choose the size of local SSD (all of them are 375GB), only how many do you want. https://cloud.google.com/compute/docs/disks/local-ssd

The intended use of Fusion is to always set only one local SSD disk that is used as cache and then run Nextflow with process.scratch=false in this way you do not need to worry about each process local disk size, the process is run on top of Fusion with "infinite" capacity.

The default GCP quota is 6TB, so this will allow 16 concurrent processes by default. But I agree that like all other GCP quotas the defaults are really low.

Maybe we can make it optional if Fusion uses local SSD as cache or the default network boot disk. The problem of using the network boot disk is that then we are doubling the network usage and the performance of Fusion should decrease a bit but it may be still good enough for many use cases, we should do some testing before.

Puumanamana commented 1 year ago

@pditommaso, when trying to do a reproducible test, I forgot to set process.scratch = false indeed. But my original issue came from trying to run many concurrent processes in parallel with fusion and getting the following error (actually warning):

Mar-20 00:22:45.773 [Task monitor] WARN  n.c.g.batch.GoogleBatchTaskHandler - Batch job cannot be run: VM in Managed Instance Group meets error: Batch Error: code - CODE_GCE_QUOTA_EXCEEDED, description - error count is 2, latest message example: Instance 'nf-8908fe23-167927-e3b48cca-de94-46750-group0-0-pn66' creation failed: Quota 'LOCAL_SSD_TOTAL_GB' exceeded.  Limit: 30000.0 in region us-central1..

I wanted to reduce the disk size requirement for my processes to be able to run more but I see now it's not possible. 16 is very restrictive (probably a bit more in my case but I'd like to run thousands in parallel). So maybe the fusion is not the right thing for my use case. @jordeu, I'm not familiar with network boot disks, would it still be considered as a fusion filesystem (all processes sharing the same disk space? Alternatively, is there anyway to mount multiple SSDs to increase the limit?

jordeu commented 1 year ago

Yes, Fusion can also work using a network disk as temporal cache (we've tested this at AWS with good results). I'll also test this setup on GCP and then we can evaluate to make the usage of local SSD optional.

jordeu commented 1 year ago

I was thinking how to make this configurable, a nice way would be to allow to define if you want to use local SSD disk or not at the level of process.

What do you think in allowing to define the disk like this:

process using_ssd {
    disk 'local-ssd: 1'
    ...
}

process using_network_disk {
   disk '10 GB'
   ...
}

I need to do some benchmarks but most likely the results with network disk are good enough, so in this way fusion will be able to use SSD disk only on some processes while in other can use normal network disk.

The nice of this solution is that also makes sense without Fusion. So if you set disk 'local-ssd: 2' Nextflow will mount two local SSD disks as /tmp in the container and it will be use as scratch (like is doing with network disks).

The recommended way to run Fusion will be with process.disk = "local-ssd: 1" but any combination of them will also make sense.

I like that in this way we decouple the definition of what kind and size of disk we use as /tmp from the Fusion usage.

Puumanamana commented 1 year ago

That makes sense. Just a clarification though: at the moment with using google-batch, the boot disk (set with google.batch.bootDiskSize) is a SSD disk right? And still with google-batch, when setting disk on top of that, it just makes the boot disk larger. So with what you propose, the user would have more flexibility with type of disk to use, which seems great to me. I'm also a bit confused by this whole process.scratch = true/false and how this plays into this new layout (does it have to always be false, and if not, what happens) Really my main issue at the moment is to run hundreds of concurrent jobs, and whether I use fusion or not, I'm limited by the SSD quota (but I'm more limited with fusion).

jordeu commented 1 year ago

Just a clarification though: at the moment with using google-batch, the boot disk (set with google.batch.bootDiskSize) is a SSD disk right?

Yes, but it's a network attached SSD disk. So the disk is an SSD but in another machine. While the "local-ssd" is an SSD in the same machine where your instance is running.

I'm also a bit confused by this whole process.scratch = true/false and how this plays into this new layout (does it have to always be false, and if not, what happens)

By default process.scratch is set to true and that means that Nextflow is going to use Google Storage utils to download/upload input/output files to/from a temporal folder in the container.

You can only set process.scratch = false when the working directory is a shared filesystem (and there is no need to download/upload files). This is not the case when you use an object storage like GS. What Fusion does is to provide this POSIX shared filesystem directly on top of object storage. So currently, there is no other way to setup a shared filesystem with Google Batch and Nextflow (it will be possible when #3630 is added).

Puumanamana commented 1 year ago

Thank you, good to know!

nextflow-io / nextflow