Spike: Add in column level descriptions to source system tables

MatMoore commented 1 month ago

Tables from source systems (e.g. nomis, delius) mostly don't have column level descriptions.

Can we populate this from existing metadata - e.g. data discovery tool or derived/raw versions of the same tables?

There is a feature in Datahub for propagating descriptions based on lineage: https://datahubproject.io/docs/automations/docs-propagation/#open-source - is this useful for us?

LavMatt commented 1 month ago

Where are source system metadata available from

It is looking likely we'll need to have this metadata populated via a number of different ways. Existing metadata are available from:

The glue catalog. For some tables there are table and column descriptions which have come direct from the source, e.g. NOMIS. However this is not the case for all the curated databases
The data discovery tool. This will likely need to be further curated because in some instances descriptions are too long for find moj data.

Otherwise it will need to be created via experts of the data, and we'd need to identify people with the necessary knowledge to populate missing descriptions

The route to Datahub

We should ensure that we are able to use create-a-derived-table repo to be the single point where all the metadata are ingested from.

We want to avoid pushing to single entities in datahub from different sources, e.g. cadet via the dbt ingestion ,supplemented by custom code to write descriptions from a source outside of cadet

Descriptions sourced from the glue catalog

These will be the simplest to integrate into the existing cadet workflows. There already exists a generate_models.py script that creates yaml config files for all the source tables. It gets all table metadata from the glue catalog however it only uses a small amount of the data it retrieves (which includes descriptions where present)

In the simplest terms the python script (if adding columns and descriptions to created yamls) would need to add new columns as defined in a table template (which it already does) and all columns from the source other than what’s excluded in the generation yaml file. (This may be over simplifying it)

We'd need to have rigorous tests for this as we would not want to break the gha worfklows by creating erroneous dbt configs for source models.

Descriptions sourced from the data discovery tool

We'd potentially need to spend a bit more time curating these for find-moj-data appropriateness and decide whether DDT or glue catalog descriptions take precedence.

This part will need much more thinking through, but likely we'd want to transfer these metadata to somewhere within the cadet repo and then have the generate_models.py script combine with the above to populate resulting yaml configs with table/column descriptions from both glue and DDT

Descriptions sourced from subject matter experts

This would be the most difficult to implement as it would be dependant on things outside of our control - like allowing the people with the required knowledge time to populate metadata and engagement from other areas to help identify who these people might actually be. We'd likely need senior level buy in if this is something we wanted to do.

I think it would be best to use the same method as above - have these metadata stored within the cadet repo.

In the case of these people being identified and able to contribute they then might not be technical staff or comfortable interacting with create-a-derived-table. So we may need to facilitate a more non technical collection process. Speaking with HMCTS catalogue team, they facilitated a similar process, sending out spreadsheets for subject matter experts to fill in. We could do something similar then converting spreadsheet data in to yaml format for our use

Other considerations

we could simplify further but just doing a scrape of what's in the glue catalog and then persisting that metadata in cadet but there are advantages to getting the data from source - say if things are changed then they will be automatically picked up and not become stale/outdated. But in reality i'm not sure how often metadata (descriptions) would change if ever
How much effort do we want to put into filling all the gaps for these source systems? Is there value to users in doing it, especially in the case where descriptions metatadata does not already exist

LavMatt commented 1 month ago

A meeting is scheduled for 02/10 with Soumaya and Gwion to discuss

LavMatt commented 1 month ago

After chatting with Soumaya and Gwion in Data Engineering i think we should initially focus on surfacing what's collected direct from source systems and available in the glue catalog now, as the initial step.

There is a desire to not overly complicate create-a-derived-table and on reflection i'm in agreement with this. So i think we should look to park the data discovery aspect of the descriptions metadata.

The preferred solution would be to have source database tables (curated silver) in find moj data reflect the descriptions as they are held in source systems - However not all of the source system pipelines currently bring in these table and column level descriptions into the AP (aws glue catalog).

There would also need to be an exercise carried out to understand for which source systems this metadata is actually available to ingest

Resulting actions

Refactor the generate_models.py cadet script to use the table and column descriptions pulled form glue in the yaml dbt config files that are produced for each table. These would need robust tests. We also discussed trialling out the new process on a dummy database version. E.g. We could create another nomis database model template to test out the end to end process. If we introduced errors to generated model config files it would cause some quite sever disruption to business critical pipelines.
Work to understand for which source system pipelines it would be possible to extract table and column level descriptions direct from source. Would be for those that currently have no metadata in the glue catalog.
Agree with Data engineering how we might work towards ingesting these metadata for sources systems as part of existing pipelines, where it is not currently collected but has shown to be available at source. (Could the data catalogue team support the development of DE pipelines to ingest additional metadata?)

ministryofjustice / find-moj-data