ministryofjustice / analytical-platform

Analytical Platform • This repository is defined and managed in Terraform
https://docs.analytical-platform.service.justice.gov.uk
MIT License
12 stars 4 forks source link

✨ Create athena workgroups for Airflow workflows #3704

Closed SoumayaMauthoorMOJ closed 7 months ago

SoumayaMauthoorMOJ commented 8 months ago

Describe the feature request.

Create Athena workgroups for Airflow athena-based workflows using IAC with 1TB query limit and following tags based on data engineering tagging convention:

Tag Value
owner Data Engineering:dataengineering@digital.justice.gov.uk
environment-name prod
application airflow
is-production True
Name airflow-prod-workgroup
Tag Value
owner Data Engineering:dataengineering@digital.justice.gov.uk
environment-name dev
application airflow
is-production False
Name airflow-dev-workgroup
Tag Value
owner Data Engineering:dataengineering@digital.justice.gov.uk
environment-name prod
application airflow
is-production True
Name airflow-prod-workgroup-hmcts
business_unit HMCTS
Tag Value
owner Data Engineering:dataengineering@digital.justice.gov.uk
environment-name dev
application airflow
is-production False
Name airflow-dev-workgroup-hmcts
business_unit HMCTS

Similar to here (but written in Pulumi)

Ideally we would create workgroups per domain, but this will require more effort in terms of coming up with appropriate domains and helping users assign the appropriate domains. So I suggest we should just stick with these workgroups for now.

There will be a separate ticket to see if there's a way of setting a default workgroup via Airflow-wide environment variables

Describe the context.

Athena costs spiked in February:

https://mojdt.slack.com/archives/CBVUV2613/p1710165396335149

It was difficult to track the culprit because most people and airflow use the primary workgroup which means all costs is lumped into one bucket

Value / Purpose

Using named workgroups with tags allows you to have better Athena cost granularity and to set limits.

This is an interim solution until we build Athena monitoring solution to track costs at the per user level.

User Types

No response

Timelines

Ideally as quickly as possible, but more realistically 1st week of April

Definition of Done

julialawrence commented 8 months ago

https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/athena_workgroup

Ed-Bajo commented 7 months ago

@SoumayaMauthoorMOJ - Note that the proposed solution (option 3 of #3777 ) would not be implemented by the team. We can have a chat on an alternative solution around creating Terraform.

SoumayaMauthoorMOJ commented 7 months ago

Sure can you set something up for Monday 29th April?

jacobwoffenden commented 7 months ago

Hi @SoumayaMauthoorMOJ!

Can you clarify what the result bucket will be please

Thanks!

SoumayaMauthoorMOJ commented 7 months ago

@jacobwoffenden s3://mojap-athena-query-dump/{workgroup} please :-) (s3://dbt-query-dump/ is specific to create-a-derived-table)