Closed LavMatt closed 2 months ago
the planned approach from DMET for primary keys in create-a-derived-table is to either:
primary_key
, see constraints; or{"primary_key": "some_column"}
with the contract constraint taking precedence over anything held in the glue catalog
electronic monitoring tables have begun to use contracts with constraints - see primary key example here
The manifest file does capture the primary_key
constraint.
However, the datahub dbt source ingestion does not appear to look for or ingest this metadata.
The deployed tables do not have this information written to glue by default, although this is potentially possible by using persist docs
Note that writing direct to glue catalog custom properties a key value pair of {"primary_key": "some_column"} is applicable for any solution, not just create-a-derived-table. The preferred method for create-a-derived-table is to use a contract.
https://github.com/moj-analytical-services/create-a-derived-table/pull/1787 this PR is set to be merged next week - can then test out of the dbt catalog file picks up any of custom properties from glue
Above PR has been merged but it seems that to work it is dependent on the upgrade to 1.8.0 which has not yet happened - Will move this to blocked until the upgrade is complete
cadet also has some models in production where a primary_key
tag has been added to column. This does propagate through to datahub via the ingestion as a tag for the column but is not recognised as the primary_key property of the column
dbt upgrade PR was merged today 29-Jul-2024 - This will mean we can resume looking at this from 30 Jul once prod models have been deployed again using upgraded package
UPDATE 30 Jul- The upgrade has caused scheduled deployments to prod to fail - This is not just a catalogue issue as it means the actual tables didn't deploy to aws as expected.
It does mean that the files we use to ingest from were not created either though
issues with the update have been resolved and models have deployed with the upgraded dbt version. Table proeprties that are written to models as part of the meta
key where persist_docs
is set to true (example https://github.com/moj-analytical-services/create-a-derived-table/pull/1787/files) are now also written to glue as table properties.
Unfortunately the dbt docs action has failed again because of a source not existing in aws - This should be sorted by tomorrow though so we can test ingesting from the new catalog file
Latest datahub dbt ingestion code does not capture the data even if it's recorded in dbt or in glue as custom table properties - via the catalog or manifest file. In fact the catalog file does not pick up any extra properties loaded to glue
It doesn't look like it would be too much effort to contribute to datahub's dbt ingestion code to get primary key metadata.
Adding code to the get_columns
functions in dbt_core, here to look for the primary_key constraint of a column from the manifest isn't too difficult and add a primary_key attribute to their DBTColumn
class here
Then make sure primary key metadata for the column class is added to the relevant mce or mcp e.g. https://github.com/acryldata/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py#L1275
Primary Keys can be added via a list of column names given as an input to the primaryKeys
arg of SchemaMetadataClass
Firstly we need to understand the cadet guys intended plan (if they have even settle don one yet) or preferred option for recording primary key information. This is something they have voiced that they would like represented within the catalogue though. Soumaya would be good to talk to initially
Following a conversation understanding their options we should explore how they could be integrated into datahub (in the first instance) - How well do they work with native dbt ingestion.