nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.61k stars 605 forks source link

Support for ipcMode in AWS batch #4979

Open nhammond opened 1 month ago

nhammond commented 1 month ago

New feature

I would like to see the ipcMode=Host setting supported in AWS batch to enable shared memory between processes.

Usage scenario

This came up when trying to run a workflow with many parallel STAR align processes. STAR has a shared memory option that can be controlled with the "--genomeLoad" flag. For example by setting "--genomeLoad LoadAndKeep", the process will only load the genome into memory if it has not already been loaded by another process, and upon completion it will remove it from memory only if it is not in use by another process. This works fine with Docker containers on the same host by using "docker run --ipc host ...", and AWS Batch supports an 'ipcMode: "host"' setting in the Batch job definition. It would be nice to enable this option via nextflow. Using a pre-loaded genome reduces startup time by about 1 minute and reduces the ram needed for each process from about 32 GB to about 4 GB, so the impact is significant when there are many STAR align processes.

Suggest implementation

It appears there are two versions of Batch job definitions, using either ContainerProperties or EcsProperties as described here: Creating job definitions using EcsProperties - AWS Batch. Nextflow is using the legacy "ContainerProperties" job definition, and this does not support ipcMode or other options described in the link above (not sure if any of the other impacted options are significant for nextflow users: dependsOn, essential, name, and pidMode).

I believe enabling ipcMode would require adding support for the new EcsProperties job definition. This could continue to support the current containerOptions directive, and ipcMode could be added to the supported containerOptions.

In addition to updating the job definition format, it appears there are overrides applied to the job definition at runtime whose structure would also need to change, from ContainerOverrides to TaskContainerOverrides. (This last point prevents the workaround of manually updating the job definition to EcsProperties format with "ipcMode=host" and continuing to run NextFlow, as this raises the error "Container overrides should not be set for ecsProperties jobs.")

bentsherman commented 1 month ago

AWS Batch can be pretty opaque about how it packs tasks onto the same VM. How can you be sure that the star align tasks will be packed together if you have other processes running at the same time?

nhammond commented 1 month ago

You're right, there is no control of that. In practice, when running 1 workflow at a time or kicking off a batch of workflows together, these STAR align jobs tend to flood all available batch instances for a window of time to where setting aside some memory could work well. If there are many different jobs running at different stages of the workflow, it would be hard to get the memory allocations right. It's something I would like to try, but I have the same concern as you.

Is there anything else pushing us toward updating from the legacy Batch job definitions? ipcMode support would become trivial if were already using the EcsPropertes job definitions, but I know that change is a heavy lift just to enable this edge use case.

bentsherman commented 1 month ago

We are planning to migrate to the AWS SDK v2 #4741 . Not sure if that's required to support the EcsProperties but it's certainly something we could fold into that effort

nhammond commented 1 month ago

It looks like both versions of the job definition are supported by both versions of the SDK, but I would still be happy to see the switch to EcsProperties job definitions and support for ipcMode included in that effort.

AWS Java SDK v1.x:

AWS Java SDK v2.x: