Open pditommaso opened 7 years ago
The key benefit of using AWS is that it provides a simple mechanism to run workflows in an environment that autoscales. At the extreme you can have zero permanently running instances (and almost zero cost) but start up instances on demand when there are jobs to execute. And still have highly elastic compute capabilities that can handle large jobs. Also have good control over limiting costs (including using spot instances).
The obvious downside is that you are tied to using AWS. The autoscaling capabilities can in principle be handled in other systems, but it is relatively hard work to do so.
We went over the overall process for executing Nextflow jobs on AWS batch. The code is currently on the aws=batch branch. In doing so we identified a significant impediment as it was necessary to have the AWS command line tools present in every Docker container that was being used. Managing alternative versions of lots of Docker images is pretty well a non-starter for most organisations (especially as these tools have complex dependencies such as Python), so we looked for alternative solution to this.
The reason the AWS CLI is needed is:
aws s3 ...
The approach identified involves installing the AWS command line tools on the AWS image that is used as the Docker host machine, and to mount the directory that contains the necessary items into each of the Docker container as a volume. This is a bit hacky, and needs checking for how portable it is, but initial testing proves that it works.
If using this approach you will need to specify the aws_cli
parameter as part of the batch executor definition to point to the location at which these CLI goodies are located. If this is not specified then it is assumed that the aws
is on the PATH of the container.
Current Status
Code and docs are on the aws-batch branch and has had some preliminary testing to prove the approach works. Further testing is required.
Alternative approaches
Other approaches can be considered if there are suggestions.
One suggestion was to use a dedicated Docker image that contains the AWS CLI as a 'sidecar' image to do the copying, but it wasn't clear how this could work.
Another is to avoid use of S3 and provide the necessary files though mounted volumes (presumably this only works if NF is being executed from within AWS). In principle this seems possible but this needs to to tested in practice.
AWS Batch integration
Nextflow has an experimental support for AWS Batch. Goal of this project is to stabilise the current implementation, to add missing features and and to make it able to process real world pipelines.
Data:
(to be provided)
Computing resources:
(to be provided)
Project Lead:
Francesco Strozzi (@fstrozzi)