outerbounds / terraform-aws-metaflow

Deploy production-grade Metaflow cloud infrastructure on AWS
https://registry.terraform.io/modules/outerbounds/metaflow/aws/latest
Apache License 2.0
56 stars 48 forks source link

Allowing multiple batch queues #50

Open byronmamamoney opened 1 year ago

byronmamamoney commented 1 year ago

Hi team would it possible to in your sub-module "computation" on batch.tf if the compute environment arn can be returned on the "metaflow" module as another output then that can be used to provision another batch queue and link it to the existing compute environment, or alternative to do a list input into the terraform module to allow for the generation of multiple queues. This way when running state machine one could do python flow.py --with batch:queue=whereyouwantittorun. This will allow for concurrently running queues against the same compute environment. The policy attached to the role that the step functions assume will have to allow the submission of a job onto the batch queues. Currently this policy will only allow step functions to submit jobs to the one default queue. This policy sits under the "step-functions" module in the file iam-step-functions.tf. It will be here where the policy will need to allow a list of batch queue arns

Cheers Byron

freespace commented 1 year ago

Exposing the compute-environment ARN would also be useful for creating additional compute environments that differ in instance-types and using them with job-queues for those flows whose resource requirements differ signficantly.

@byronmamamoney as an intermediate workaround, the compute environment ARN can be retrieved by

  1. use string manipulation to get the job-queue-name from module.metaflow-computation.METAFLOW_BATCH_JOB_QUEUE
  2. use job-queue-name to instance data "aws_batch_job_queue" "metaflow_queue"
  3. ARN of the compute environment is now data.aws_batch_job_queue.metaflow_queue.compute_environment_order.0.compute_environment