Closed MatMoore closed 1 month ago
It is looking likely we'll need to have this metadata populated via a number of different ways. Existing metadata are available from:
Otherwise it will need to be created via experts of the data, and we'd need to identify people with the necessary knowledge to populate missing descriptions
We should ensure that we are able to use create-a-derived-table repo to be the single point where all the metadata are ingested from.
We want to avoid pushing to single entities in datahub from different sources, e.g. cadet via the dbt ingestion ,supplemented by custom code to write descriptions from a source outside of cadet
These will be the simplest to integrate into the existing cadet workflows. There already exists a generate_models.py
script that creates yaml config files for all the source tables. It gets all table metadata from the glue catalog however it only uses a small amount of the data it retrieves (which includes descriptions where present)
In the simplest terms the python script (if adding columns and descriptions to created yamls) would need to add new columns as defined in a table template (which it already does) and all columns from the source other than what’s excluded in the generation yaml file. (This may be over simplifying it)
We'd need to have rigorous tests for this as we would not want to break the gha worfklows by creating erroneous dbt configs for source models.
We'd potentially need to spend a bit more time curating these for find-moj-data appropriateness and decide whether DDT or glue catalog descriptions take precedence.
This part will need much more thinking through, but likely we'd want to transfer these metadata to somewhere within the cadet repo and then have the generate_models.py script combine with the above to populate resulting yaml configs with table/column descriptions from both glue and DDT
This would be the most difficult to implement as it would be dependant on things outside of our control - like allowing the people with the required knowledge time to populate metadata and engagement from other areas to help identify who these people might actually be. We'd likely need senior level buy in if this is something we wanted to do.
I think it would be best to use the same method as above - have these metadata stored within the cadet repo.
In the case of these people being identified and able to contribute they then might not be technical staff or comfortable interacting with create-a-derived-table. So we may need to facilitate a more non technical collection process. Speaking with HMCTS catalogue team, they facilitated a similar process, sending out spreadsheets for subject matter experts to fill in. We could do something similar then converting spreadsheet data in to yaml format for our use
A meeting is scheduled for 02/10 with Soumaya and Gwion to discuss
After chatting with Soumaya and Gwion in Data Engineering i think we should initially focus on surfacing what's collected direct from source systems and available in the glue catalog now, as the initial step.
There is a desire to not overly complicate create-a-derived-table and on reflection i'm in agreement with this. So i think we should look to park the data discovery aspect of the descriptions metadata.
The preferred solution would be to have source database tables (curated silver) in find moj data reflect the descriptions as they are held in source systems - However not all of the source system pipelines currently bring in these table and column level descriptions into the AP (aws glue catalog).
There would also need to be an exercise carried out to understand for which source systems this metadata is actually available to ingest
generate_models.py
cadet script to use the table and column descriptions pulled form glue in the yaml dbt config files that are produced for each table. These would need robust tests. We also discussed trialling out the new process on a dummy database version. E.g. We could create another nomis database model template to test out the end to end process. If we introduced errors to generated model config files it would cause some quite sever disruption to business critical pipelines.
Tables from source systems (e.g. nomis, delius) mostly don't have column level descriptions.
Can we populate this from existing metadata - e.g. data discovery tool or derived/raw versions of the same tables?
There is a feature in Datahub for propagating descriptions based on lineage: https://datahubproject.io/docs/automations/docs-propagation/#open-source - is this useful for us?