outerbounds / terraform-aws-metaflow

Deploy production-grade Metaflow cloud infrastructure on AWS
https://registry.terraform.io/modules/outerbounds/metaflow/aws/latest
Apache License 2.0
58 stars 49 forks source link

Task is stuck on RUNNABLE #91

Open EdIzaguirre opened 5 months ago

EdIzaguirre commented 5 months ago

Hi all,

As the title suggests, I am getting the infamous "Task stuck on RUNNABLE" line when I try to run this simple flow:

from metaflow import FlowSpec, step
import os
global_value = 5

class ProcessDemoFlow(FlowSpec):
    @step
    def start(self):
        global global_value
        global_value = 9
        print('process ID is', os.getpid())
        print('global_value is', global_value)
        self.next(self.end)

    @step
    def end(self):
        print('process ID is', os.getpid())
        print('global_value is', global_value)

if __name__ == '__main__':
    ProcessDemoFlow()

For provisioning the infrastructure, I used the minimal Terraform AWS template on the README. However, I had to make a few adjustments to remove some errors (I could not have been the only one...):

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  version = "5.5.3"
...

In the vpc module I had to modify the version number to the latest version: 5.5.3.

module "metaflow" {
  source = "outerbounds/metaflow/aws"
  version = "0.12.0"

  resource_prefix = local.resource_prefix
  resource_suffix = local.resource_suffix

  enable_step_functions = false
  subnet1_id            = module.vpc.public_subnets[0]
  subnet2_id            = module.vpc.public_subnets[1]
  vpc_cidr_blocks       = [module.vpc.vpc_cidr_block]
  vpc_id                = module.vpc.vpc_id
  with_public_ip        = true
  db_engine_version     = 16
  db_instance_type      = "db.t3.small"

  tags = {
      "managedBy" = "terraform"
  }
}

In the metaflow module I had to change module.vpc.vpc_cidr_blocks to [module.vpc.vpc_cidr_block], because I was getting an error saying that module.vpc.vpc_cidr_blocks didn't exist. I confirmed that no such variable exists in the vpc module (no idea why this is in the template...). I also had to update the version number to 0.12.0. Finally, I got an error stating that the combination of Postgres, a db_engine_version of "11", and a db_instance_type of "db.t2.small" (default values) is not allowed by AWS. So I updated the engine_version to 16 and db_instance_type to "db.t3.small".

I have done basic checks regarding looking at Batch, ECS, and EC2, and everything appears to be connected, valid, etc. What could be going wrong? I keep seeing stuff about my compute environment might be too limited for the task, but given that this is a very basic task I don't think that is the issue. Are one of the modifications I made to the minimal template wrong?