Spike: Can we pick up domains from create a derived table

seanprivett commented 9 months ago

Resources: https://mojdt.slack.com/archives/C03QZ776JVA/p1708360503491379?thread_ts=1708360446.980769&cid=C03QZ776JVA

YvanMOJdigital commented 9 months ago

Investigate process for importing domains from CDT. Import to test environment to avoid disrupting UR.

LavMatt commented 9 months ago

see domains here (the folders) https://github.com/moj-analytical-services/create-a-derived-table/tree/main/mojap_derived_tables/models

LavMatt commented 8 months ago

The short answer is yes, we can pick up domains from create-a-derived-table in an automated way.

But there are some points to consider in the details of implementation.

The concept of domains, whilst recognised by DBT as a core concept of data management, it is not something that DBT captures as an explicit metadata property, i.e. it does not have it's own key/value in the overall DBT manifest json file (file containing all table metadata/configurations).

This video from DBT shows create-a-derived-table follows the recommended implementation of data domains through folders.

Domain Ingestion Methods

It's possible to create a custom ingestion method to handle domain ingestion from create-a-derived-table, which can derive and ingest domains as set in the manifest json created on each run of DBT. Because of the point raised above (no domain key) there are a couple of different approaches that could be taken:

Infer domains via the latest DBT manifest file, using the fqn key (fully qualified name). This is a list created by DBT related to the path of tables, where the 2nd item of that list is always the domain folder.
Infer domains via the latest DBT manifest file using the external_location key, which is the s3 path to table files. This follows hive partition pathing conventions e.g. domain=prison/database_name=db_1/table_name=tb1/...
Could also be possible (although not explored due to it needing changes made to create a derived table projects) to add domain tags to model configurations and then pull these from the central manifest.

I have created a PoC custom ingestion source following option 1 and have ingested domains into the Datahub test env

Potential Issues

Neither of the options is perfect and would potentially break with changes to folder structure or naming conventions.
Custom ingestions can not yet be run via the datahub UI, so we couldn't setup and configure domain ingestions/alignment to run out of our Datahub instance. (we could run via airflow though if wanting to regularly schedule). Mat M thought maybe we can do this though.
Datahub domains allow for other associated metadata, e.g. a description. These methods would only allow ingestion (with current create-a-derived-table setup) of the names of the domains

LavMatt commented 8 months ago

Talking with @SoumayaMauthoorMOJ, there is a possibly a chance that domains will be introduced as dynamic tags within the config of create-a-derived-table and as part of the work planned to alter how domains are represented within s3 paths.

ministryofjustice / find-moj-data

Spike: Can we pick up domains from create a derived table #108

Domain Ingestion Methods

Potential Issues