Open seanprivett opened 9 months ago
Investigate process for importing domains from CDT. Import to test environment to avoid disrupting UR.
see domains here (the folders) https://github.com/moj-analytical-services/create-a-derived-table/tree/main/mojap_derived_tables/models
The short answer is yes, we can pick up domains from create-a-derived-table in an automated way.
But there are some points to consider in the details of implementation.
The concept of domains, whilst recognised by DBT as a core concept of data management, it is not something that DBT captures as an explicit metadata property, i.e. it does not have it's own key/value in the overall DBT manifest json file (file containing all table metadata/configurations).
This video from DBT shows create-a-derived-table follows the recommended implementation of data domains through folders.
It's possible to create a custom ingestion method to handle domain ingestion from create-a-derived-table, which can derive and ingest domains as set in the manifest json created on each run of DBT. Because of the point raised above (no domain key) there are a couple of different approaches that could be taken:
fqn
key (fully qualified name). This is a list created by DBT related to the path of tables, where the 2nd item of that list is always the domain folder.external_location
key, which is the s3 path to table files. This follows hive partition pathing conventions e.g. domain=prison/database_name=db_1/table_name=tb1/...
I have created a PoC custom ingestion source following option 1 and have ingested domains into the Datahub test env
Talking with @SoumayaMauthoorMOJ, there is a possibly a chance that domains will be introduced as dynamic tags within the config of create-a-derived-table and as part of the work planned to alter how domains are represented within s3 paths.
Resources: https://mojdt.slack.com/archives/C03QZ776JVA/p1708360503491379?thread_ts=1708360446.980769&cid=C03QZ776JVA