ministryofjustice / analytical-platform

Analytical Platform • This repository is defined and managed in Terraform
https://docs.analytical-platform.service.justice.gov.uk
MIT License
12 stars 4 forks source link

🚚 Migrate Airflow workloads to APC #4490

Open jacobwoffenden opened 5 months ago

jacobwoffenden commented 5 months ago

User Story

As an Analytical Platform engineer I want (current) Airflow jobs to schedule on APC So that we can fully retire the Airflow EKS clusters

Value / Purpose

Airflow EKS clusters are partially managed in Terraform, pinned to IMDSv1, use kube2iam, and have no observability 😭

Migrating these workloads to APC will allow us to retire more clusters and make use of the newer capabilities in EKS and the supported tooling.

Useful Contacts

@jacobwoffenden

User Types

Platform Engineering

Hypothesis

If we... [do a thing] Then... [this will happen]

Proposal

Migrate Airflow workloads to APC

Additional Information

This was sort of started in DPAT https://github.com/ministryofjustice/analytical-platform/issues/2843 but never happened

Blocked by:

Definition of Done

jacobwoffenden commented 5 months ago

Blocked while Airflow component is being worked on

jacobwoffenden commented 5 months ago

Comms sent to ask-data-engineering with sheet to fill in https://docs.google.com/spreadsheets/d/1B8DOsSgnxGV1FjRv8dLv0wqDMo2RiiMqedFogLBpQEQ

jacobwoffenden commented 5 months ago

Moving back to blocked while IRSA is being worked on

jacobwoffenden commented 5 months ago

I've cut a new release of the cross-account-ecr action, published a new version of template-airflow-python which used the new v1 action and correctly adds APC accounts to repo policy.

I then updated the example dag to use the new image version and APC dev context (https://github.com/moj-analytical-services/airflow/pull/3613) and below is the output when running it (even though it fails because it can't use IRSA yet, it still pulls)

vscode ➜ /workspaces/modernisation-platform-environments (main) [ aws: analytical-platform-compute-development:modernisation-platform-sandbox@eu-west-2 ] [ context: arn:aws:eks:eu-west-2:381491960855:cluster/analytical-platform-compute-development ] $ kubectl --namespace airflow get events                                     
LAST SEEN   TYPE     REASON      OBJECT                                        MESSAGE
59s         Normal   Scheduled   pod/task-1-cecda48866f94f90a3357d96206822b6   Successfully assigned airflow/task-1-cecda48866f94f90a3357d96206822b6 to ip-10-200-33-237.eu-west-2.compute.internal
58s         Normal   Pulling     pod/task-1-cecda48866f94f90a3357d96206822b6   Pulling image "189157455002.dkr.ecr.eu-west-1.amazonaws.com/template-airflow-python:v0.4"
53s         Normal   Pulled      pod/task-1-cecda48866f94f90a3357d96206822b6   Successfully pulled image "189157455002.dkr.ecr.eu-west-1.amazonaws.com/template-airflow-python:v0.4" in 5.264s (5.264s including waiting). Image size: 76701464 bytes.
jacobwoffenden commented 5 months ago

APC OIDC added to APDP

jacobwoffenden commented 5 months ago

We've tested @AntFMoJ's toy DAG on APC with IRSA cross account and its working 🎉

Unfortunately we are now blocked in discussion with Modernisation Platform about reuse of network ranges.

jacobwoffenden commented 5 months ago

Updates:

jacobwoffenden commented 4 months ago

Moving to blocked while we figure out how to proceed with Direct Connect.

jacobwoffenden commented 4 months ago

Meeting with HMCTS' network architect on 11/07/24 @ 11:30 BST

darren1988 commented 4 months ago

Escalated to HMCTS head of DTS people and profession on 24th July 2024. Our ask has now been raised with the lead PlatOps in HMCTS. Currently awaiting on a response. If no movement by the end of the week will escalate to Martyn.

jacobwoffenden commented 3 months ago

Meeting help with DTS PlatOps 5/8/24 and has been escalated. Waiting for meeting to be be arranged with HMCTS stakeholders.

darren1988 commented 3 months ago

Meeting with HMCTS arranged for 5/9/24 to discuss scope of work

jacobwoffenden commented 1 month ago

Had meeting with HMCTS, they are going to put is in touch with CloudGateway

jacobwoffenden commented 1 month ago

Sent chaser email on 15/10 and 22/10

jacobwoffenden commented 3 weeks ago

meeting arranged with cloudgateway for 4/11

jacobwoffenden commented 2 weeks ago

VPN endpoint data sent cloud gateway, awaiting response

jacobwoffenden commented 1 week ago

Updated VPN configuration parameters and sent over. Apparently we are waiting on commercials too.

jacobwoffenden commented 1 week ago

Pencilled some time in on Thursday 21/11 to bridge with CGW

jacobwoffenden commented 1 day ago

nonprod was cutover on 21/11 🎉 prod is being arranged for 27/11

jacobwoffenden commented 22 hours ago

Now 2/12, maintenance posted https://status.analytical-platform.service.justice.gov.uk/posts/details/PK6KZ5V