nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.73k stars 624 forks source link

Specify Google Cloud Compute Engine disk type #1444

Closed twbattaglia closed 1 year ago

twbattaglia commented 4 years ago

New feature

Ability to specify the Compute Engine disk type (pd-standard or local-SSD) found in the new Cloud Life Sciences API (https://cloud.google.com/life-sciences/docs/reference/rpc/google.cloud.lifesciences.v2beta#disk).

Usage scenario

Job's that require a high input/output operations per second and lower latency (https://cloud.google.com/compute/docs/disks/local-ssd).

Suggest implementation

The API documentation states it can be set using setType() (https://developers.google.com/resources/api-libraries/documentation/genomics/v1alpha2/java/latest/com/google/api/services/genomics/model/Disk.html#setType-java.lang.String-)

Add disk type during formation of VM in GoogleLifeSciencesHelper.groovy

protected Resources createResources(GoogleLifeSciencesSubmitRequest req) {
        def disk = new Disk()
        disk.setName(req.diskName)
        disk.setSizeGb(req.diskSizeGb)
        disk.setType(req.diskType)

Where req.diskType is specified in GoogleLifeSciencesTaskHandler.groovy

    req.bootDiskSizeGb = executor.config.bootDiskSize?.toGiga() as Integer
    req.diskType = task.config.getDiskType() as String
    return req

getDiskType() can be set within TaskConfig.groovy, where it is set to pd-standard by default.

    String getDiskType() {
        def value = get('diskType')

        if( !value ) return "pd-standard"

        if (value.toString()=="pd-standard" || value.toString()=="local-ssd") {
            return value.toString()
        } else {
            return "pd-standard"
        }
    }

Preliminary tests showed it was successful to generate a Computer Engine instance with SSD attached.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

lescai commented 4 years ago

I would definitely support this. The key logic of nextflow is a little challenged on the cloud: unless one has a shared disk which can be mounted by all tasks VMs, each task will copy back and forth files to/from the bucket instead of using sym links as on-prem. This behaviour huuuugely multiplies costs by increasing both I/O and runtime. The possibility of specifying the disk type could change the IOPS of the VMs and improve performance on worker VMs. This feature would help optimizing nextflow pipelines on the cloud. quite important :)

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pditommaso commented 3 years ago

Bump

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bentsherman commented 2 years ago

We should be able to support this feature for both google life sciences and google batch. I think the best way to support it in Nextflow would be to add a DiskResource class so that the disk type can be specified in the disk directive, like with accelerator. I have laid the groundwork for this in #3027, so when we merge that PR then I can implement the disk type.

Puumanamana commented 1 year ago

+1, would be very useful for tasks like fasterq-dump

pditommaso commented 1 year ago

Google Batch does support SDD disk when using Fusion file system. See here

https://www.nextflow.io/docs/edge/google.html#fusion-file-system

bentsherman commented 1 year ago

Support for disk type was added to Google Batch in #3861 . We aren't really adding new features to the google-lifesciences executor because we encourage users to migrate to Google Batch, so I'm going to close this issue.