ministryofjustice / analytical-platform

Analytical Platform • This repository is defined and managed in Terraform
https://docs.analytical-platform.service.justice.gov.uk
MIT License
8 stars 4 forks source link

🚚 Migrate Airflow workloads to APC #4490

Open jacobwoffenden opened 4 weeks ago

jacobwoffenden commented 4 weeks ago

User Story

As an Analytical Platform engineer I want (current) Airflow jobs to schedule on APC So that we can fully retire the Airflow EKS clusters

Value / Purpose

Airflow EKS clusters are partially managed in Terraform, pinned to IMDSv1, use kube2iam, and have no observability 😭

Migrating these workloads to APC will allow us to retire more clusters and make use of the newer capabilities in EKS and the supported tooling.

Useful Contacts

@jacobwoffenden

User Types

Platform Engineering

Hypothesis

If we... [do a thing] Then... [this will happen]

Proposal

Migrate Airflow workloads to APC

Additional Information

This was sort of started in DPAT https://github.com/ministryofjustice/analytical-platform/issues/2843 but never happened

Blocked by:

Definition of Done

jacobwoffenden commented 4 weeks ago

Blocked while Airflow component is being worked on

jacobwoffenden commented 3 weeks ago

Comms sent to ask-data-engineering with sheet to fill in https://docs.google.com/spreadsheets/d/1B8DOsSgnxGV1FjRv8dLv0wqDMo2RiiMqedFogLBpQEQ

jacobwoffenden commented 3 weeks ago

Moving back to blocked while IRSA is being worked on

jacobwoffenden commented 3 weeks ago

I've cut a new release of the cross-account-ecr action, published a new version of template-airflow-python which used the new v1 action and correctly adds APC accounts to repo policy.

I then updated the example dag to use the new image version and APC dev context (https://github.com/moj-analytical-services/airflow/pull/3613) and below is the output when running it (even though it fails because it can't use IRSA yet, it still pulls)

vscode ➜ /workspaces/modernisation-platform-environments (main) [ aws: analytical-platform-compute-development:modernisation-platform-sandbox@eu-west-2 ] [ context: arn:aws:eks:eu-west-2:381491960855:cluster/analytical-platform-compute-development ] $ kubectl --namespace airflow get events                                     
LAST SEEN   TYPE     REASON      OBJECT                                        MESSAGE
59s         Normal   Scheduled   pod/task-1-cecda48866f94f90a3357d96206822b6   Successfully assigned airflow/task-1-cecda48866f94f90a3357d96206822b6 to ip-10-200-33-237.eu-west-2.compute.internal
58s         Normal   Pulling     pod/task-1-cecda48866f94f90a3357d96206822b6   Pulling image "189157455002.dkr.ecr.eu-west-1.amazonaws.com/template-airflow-python:v0.4"
53s         Normal   Pulled      pod/task-1-cecda48866f94f90a3357d96206822b6   Successfully pulled image "189157455002.dkr.ecr.eu-west-1.amazonaws.com/template-airflow-python:v0.4" in 5.264s (5.264s including waiting). Image size: 76701464 bytes.
jacobwoffenden commented 2 weeks ago

APC OIDC added to APDP

jacobwoffenden commented 2 weeks ago

We've tested @AntFMoJ's toy DAG on APC with IRSA cross account and its working 🎉

Unfortunately we are now blocked in discussion with Modernisation Platform about reuse of network ranges.

jacobwoffenden commented 1 week ago

Updates:

jacobwoffenden commented 4 days ago

Moving to blocked while we figure out how to proceed with Direct Connect.