outerbounds / terraform-aws-metaflow

Deploy production-grade Metaflow cloud infrastructure on AWS
https://registry.terraform.io/modules/outerbounds/metaflow/aws/latest
Apache License 2.0
55 stars 45 forks source link

Does't work with provided examples (examples/minimal) #82

Open dmumpuu opened 9 months ago

dmumpuu commented 9 months ago

Steps to reproduce:

  1. Clone repository and cd terraform-aws-metaflow/examples/minimal
  2. Set locals.resource_prefix = "test-metaflow" in minimal_example.tf
  3. Run terraform apply and wait until it finishes
  4. Run aws apigateway get-api-key --api-key <api-key> --include-value | grep value and paste the result to the metaflow_profile.json file
  5. Import Metaflow configuration: metaflow configure import metaflow_profile.json
  6. Run python mftest.py run
    mftest.py
from metaflow import FlowSpec, step, batch, resources

class MfTest(FlowSpec):
    @step
    def start(self):
        print("Started")
        self.next(self.run_batch)

    @batch
    @resources(cpu=1, memory=1_000)
    @step
    def run_batch(self):
        print("Hello from @batch")
        self.next(self.end)

    @step
    def end(self):
        print("Finished")

if __name__ == '__main__':
    MfTest()

The running task will never finish, created AWS Batch Job in the AWS Batch Job queue is always in status RUNNABLE

Also tried with outerbounds/metaflow/aws version=0.10.1 and terraform-aws-modules/vpc/aws version = 5.1.2

Generated metaflow_profile.json ```json { "METAFLOW_BATCH_JOB_QUEUE": "arn:aws:batch:::job-queue/test-metaflow-", "METAFLOW_DATASTORE_SYSROOT_S3": "s3://test-metaflow-s3-/metaflow", "METAFLOW_DATATOOLS_S3ROOT": "s3://test-metaflow-s3-/data", "METAFLOW_DEFAULT_DATASTORE": "s3", "METAFLOW_DEFAULT_METADATA": "service", "METAFLOW_ECS_S3_ACCESS_IAM_ROLE": "arn:aws:iam:::role/test-metaflow-batch_s3_task_role-", "METAFLOW_EVENTS_SFN_ACCESS_IAM_ROLE": "", "METAFLOW_SERVICE_AUTH_KEY": , "METAFLOW_SERVICE_INTERNAL_URL": "http://test-metaflow-nlb--.elb..amazonaws.com/", "METAFLOW_SERVICE_URL": "https://.execute-api..amazonaws.com/api/", "METAFLOW_SFN_DYNAMO_DB_TABLE": "", "METAFLOW_SFN_IAM_ROLE": "", "METAFLOW_SFN_STATE_MACHINE_PREFIX": "test-metaflow-" } ```
vfilter commented 4 months ago

Edit: It's even worse, the whole thing now cannot be destroyed with the Terraform CLI. That means I gotta manually go in and delete the resources. πŸ‘Ž

The examples are outdated and don't work. I tried the eks_argo TF example, and it would bork out with

β”‚ Warning: Argument is deprecated
β”‚
β”‚   with module.metaflow-datastore.aws_s3_bucket.this,
β”‚   on .terraform/modules/metaflow-datastore/modules/datastore/s3.tf line 1, in resource "aws_s3_bucket" "this":
β”‚    1: resource "aws_s3_bucket" "this" {
β”‚
β”‚ Use the aws_s3_bucket_server_side_encryption_configuration resource instead
β”‚
β”‚ (and one more similar warning elsewhere)
β•΅
β•·
β”‚ Error: creating Lambda Function (metaflowdb_migrateir9nhhph): operation error Lambda: CreateFunction, https response error StatusCode: 400, RequestID: XXX, InvalidParameterValueException: The runtime parameter of python3.7 is no longer supported for creating or updating AWS Lambda functions. We recommend you use the new runtime (python3.12) while creating or updating functions.
β”‚
β”‚   with module.metaflow-metadata-service.aws_lambda_function.db_migrate_lambda,
β”‚   on .terraform/modules/metaflow-metadata-service/modules/metadata-service/lambda.tf line 115, in resource "aws_lambda_function" "db_migrate_lambda":
β”‚  115: resource "aws_lambda_function" "db_migrate_lambda" {

Also, the EKS version in the examples is outdated and is unsupported from March 2024. I understand that these are simply some first steps, but still a bit disappointing. We built our own Metaflow deployment with AWS CDK but CDK got its own issues and AWS step functions is excruciatingly slow, so I was really hoping for some speed improvements using Argo + k8s + TF both for deployment of the infrastructure and for deployment of workflows.