open-constructs / aws-cdk-library

Community-Driven CDK Construct Library
Apache License 2.0
44 stars 6 forks source link

CloudFormation Orchestration in the CDK #2

Closed michanto closed 1 week ago

michanto commented 2 months ago

Introduction

CloudFormation has all the makings of an orchestration engine for the creation and distribution of artifacts, such as ML models. An ML model relies on code built by a build system (such as CodeBuild), and a subsequent model training step that requires an execution environment, such as StepFunctions, Fargate, or AWS Batch. Lastly, the ML model needs to be distributed to consumers in a manner that isolates the consumer from the ML model tool chain, for example by being published to a GIT (or S3-backed GITLFS) repository.

CDK custom resources provide the necessary mechanism for enabling Orchestration scenarios in the CDK. They provide up to one (1) hour of execution time, which can be extened (see LongRunningStepFunctionTask below) to four (4) days. It has a storage solution, in the form of S3, which allows artifacts to be partitioned by build id, and tagged with other relevant provenance data. It has orchestration - a dependency between custom resources based on inputs (attributes from other custom resources), attributes (inputs for other custom resources), and outright construct dependencies. These orchestration mechanisms allow a user to define the order of operations for Orchestration tasks.

What the CDK is missing is classes that support Orchestration directly. My proposal is to fill in these gaps and turn the CDK into a powerful GitOps framework. Much of this work has already been written with an early version of JSII, and used extensively by me and my teams. What it needs is ownership by a team willing to harden this already battle-tested framework.

The missing Orchestration Constructs

The CDK is missing a few constructs that would allow Orchestration to become a regular part of the CDK users tool kit. These include:

LambdaCustomResource

AwsCustomResource is missing a few features that would allow it to be used in more situations. LambdaCustomResource rectifies these issues, including:

LambdaTask

Uses LambdaCustomResource to create a Task that allows ANY lambda to be used as a custom resource, and surface any return value as a resource attribute. Supports runAlways, which adds the build time (Date.now()) as a resource property to ensure that each build results in a new task execution.

StepFunctionTask

A provider-based custom resource that executes and monitors any StepFunction that can run in an hour or less. StepFunction outputs are flattened and filtered similar to how AwsCustomResource does (while supporting default attributes similar to LambdaCustomResource). The Physical resource ID of a StepFunctionTask is the execution ID, which means the physical ID changes every time this Task runs. Also supports runAlways.

LongRunningStepFunctionTask

Strings together multiple StepFunctionTasks to allow for StepFunction execution times of up to 4 days (CFN limitation). One StepFunctionTask custom resource is created for each hour of desired execution time. If a StepFunction exits before all the StepFunctionTasks execute, the remainder will fast-fail or fast-succeed, resulting in a good user experience.

S3FileReader

Reads, flattens, and filters the contents of any S3 JSON file so they can be exposed as Custom Resource Attributes. This allows cross-account (or within-account) communication via shared S3 buckets. Allows LambdaTask and StepFunctionTask to communicate over S3 rather than having to return the attributes.

S3MetadataReader

Reads, flattens, and filters the metadata for any S3 file so they can be exposed as Custom Resource Attributes. This allows cross-account (or within-account) communication via shared S3 buckets. Allows LambdaTask and StepFunctionTask to communicate over S3 rather than having to return the attributes.

GitCommitFromS3

Takes the contents of an S3 bucket and prefix and commits them to a GIT repository. Optionally tags the commit. Allows for plug ins that provide Git SSH credentials. Assuming the git repo is part of a consumers git-ops workflow, this completes the git-ops orchestration framwork.

Support Classes

There is a set of support classes that enable the above. Details to come.

michanto commented 2 months ago

ConstructTreeSearch is one of the support classes. See issue 1 (https://github.com/open-constructs/aws-cdk-library/issues/1) for details.

mbonig commented 1 month ago

This sounds like a fascinating L3 to build. This: "My proposal is to fill in these gaps and turn the CDK into a powerful GitOps framework.", makes me squirm a little in my seat, but I can't really say why. Is there a place I can see what your team has built so I can better understand what you're proposing overall?

michanto commented 1 week ago

I am moving this over to my cdk-orchestration package. Maybe this will get moved into open-constructs at some point, but I think putting it out as a usable library is the best path forward.