ministryofjustice / find-moj-data

Find MOJ data service • This repository is defined and managed in Terraform
MIT License
5 stars 0 forks source link

Spike: options for populating Primary Key metadata from CaDeT #417

Closed LavMatt closed 2 months ago

LavMatt commented 3 months ago

Firstly we need to understand the cadet guys intended plan (if they have even settle don one yet) or preferred option for recording primary key information. This is something they have voiced that they would like represented within the catalogue though. Soumaya would be good to talk to initially

Following a conversation understanding their options we should explore how they could be integrated into datahub (in the first instance) - How well do they work with native dbt ingestion.

LavMatt commented 2 months ago

DMET Approach

the planned approach from DMET for primary keys in create-a-derived-table is to either:

with the contract constraint taking precedence over anything held in the glue catalog

LavMatt commented 2 months ago

Production example of dbt contraints in cadet

electronic monitoring tables have begun to use contracts with constraints - see primary key example here

The manifest file does capture the primary_key constraint.

However, the datahub dbt source ingestion does not appear to look for or ingest this metadata.

The deployed tables do not have this information written to glue by default, although this is potentially possible by using persist docs

SoumayaMauthoorMOJ commented 2 months ago

Note that writing direct to glue catalog custom properties a key value pair of {"primary_key": "some_column"} is applicable for any solution, not just create-a-derived-table. The preferred method for create-a-derived-table is to use a contract.

LavMatt commented 2 months ago

https://github.com/moj-analytical-services/create-a-derived-table/pull/1787 this PR is set to be merged next week - can then test out of the dbt catalog file picks up any of custom properties from glue

LavMatt commented 2 months ago

Above PR has been merged but it seems that to work it is dependent on the upgrade to 1.8.0 which has not yet happened - Will move this to blocked until the upgrade is complete

LavMatt commented 2 months ago

cadet also has some models in production where a primary_key tag has been added to column. This does propagate through to datahub via the ingestion as a tag for the column but is not recognised as the primary_key property of the column

see cadet: https://github.com/moj-analytical-services/create-a-derived-table/blob/46564b72e834b4be1ab6746dadf1d9db83aca70a/mojap_derived_tables/models/finance/lookup_finance_stg/lookup_finance.yml#L8

see datahub: https://datahub-catalogue-prod.apps.live.cloud-platform.service.justice.gov.uk/dataset/urn:li:dataset:(urn:li:dataPlatform:dbt,cadet.awsdatacatalog.lookup_finance_stg.base_account_codes,PROD)/Schema?is_lineage_mode=false&schemaFilter=

LavMatt commented 2 months ago

dbt upgrade PR was merged today 29-Jul-2024 - This will mean we can resume looking at this from 30 Jul once prod models have been deployed again using upgraded package

UPDATE 30 Jul- The upgrade has caused scheduled deployments to prod to fail - This is not just a catalogue issue as it means the actual tables didn't deploy to aws as expected.

It does mean that the files we use to ingest from were not created either though

LavMatt commented 2 months ago

issues with the update have been resolved and models have deployed with the upgraded dbt version. Table proeprties that are written to models as part of the meta key where persist_docs is set to true (example https://github.com/moj-analytical-services/create-a-derived-table/pull/1787/files) are now also written to glue as table properties.

Unfortunately the dbt docs action has failed again because of a source not existing in aws - This should be sorted by tomorrow though so we can test ingesting from the new catalog file

LavMatt commented 2 months ago

Latest datahub dbt ingestion code does not capture the data even if it's recorded in dbt or in glue as custom table properties - via the catalog or manifest file. In fact the catalog file does not pick up any extra properties loaded to glue

Recommendation

It doesn't look like it would be too much effort to contribute to datahub's dbt ingestion code to get primary key metadata.

Adding code to the get_columns functions in dbt_core, here to look for the primary_key constraint of a column from the manifest isn't too difficult and add a primary_key attribute to their DBTColumn class here

Then make sure primary key metadata for the column class is added to the relevant mce or mcp e.g. https://github.com/acryldata/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py#L1275

Primary Keys can be added via a list of column names given as an input to the primaryKeys arg of SchemaMetadataClass