Spike: Find out which source systems have table/column descriptions available to pull direct from source databases

LavMatt commented 3 weeks ago

User Story

As a developer I want to understand and document the available metadata, with respect to column and table descriptions of source databases So that i can work towards ingesting it into the find-moj-data service

Value / Purpose

No response

Useful Contacts

Data engineering domain leads

User Types

No response

Hypothesis

If we... [do a thing] Then... [this will happen]

Proposal

We know there are description metadata available to get directly from source databases (like nomis, oasys) and that some of these metadata are already ingested into the AP and glue by DE pipelines.

What we don't know is where data are missing - what available and potentially possible to include in data engineering pipelines

We should find out!

We should also document the source databases that are external of create-a-derived-table but are in the glue catalog - these will not currently be ingested, they probably should be and also might have metadata available

This is quite a big ask potentially so we should maybe split the work further during a refinement sessions:

Document the gaps as is in the glue catalog by database.
We could then make individual tickets to investigate whether gaps are available to get direct from source per database or per domain.

This Spike could potentially do those 2 tasks

Additional Information

No response

Definition of Done

[ ] create a list all source databases on the AP (in the glue catalog) and not on the AP, and indicate whether the AP available databases are included within the current cadet ingestion.
[ ] Document the metadata gaps as is in the glue catalog by database, adding to the list whether description metadata available in glue.
[ ] make individual tickets to investigate whether gaps are available to get direct from source per database or per domain.

murdo-moj commented 3 days ago

https://justiceuk-my.sharepoint.com/:x:/g/personal/matthew_laverty_justice_gov_uk/EU5ljCQ_HbtNsO2Ub9qn49gBpF173qAY12FeH4IMxUw2RQ?email=Murdo.Moyse1%40justice.gov.uk&e=cfBHkW

MatMoore commented 2 days ago

I've added to the sharepoint a dump of all databases in the AP, and filtered out dev/sandbox stuff and staging data.

I'm in the process of trying to identify which ones are sources. In addition to the ones we know about, there's also things like the the People Survey, and more general purpose data such as Ordnance survey data. I'm assuming all of this is worth cataloguing.

IMO we don't need to understand everything that's in AP (if that is even possible), but a decent amount of these do have at a least a database-level description in Glue, so I think we can have one task just to ingest these ones as they are, and do that first.

Then there are those source systems we've identified but don't have any glue metadata. If we want to get this into the catalogue quickly, we could come up with database-level descriptions ourselves based on information on confluence etc. Then I think we can create tickets to engage the pipeline owners and enrich the table-level metadata via the glue catalogue.

I've also added a tab for derived data that is not coming from CaDeT, such as the Data First outputs. I'm assuming this is also valuable to catalogue, even though it doesn't fall under the scope of this ticket.

murdo-moj commented 1 day ago

Matt L and I are talking to Oliver Critchfield on Fri 25th. He's in charge of HMCTS datasets data engineering wise.

murdo-moj commented 1 day ago

I asked the data modellers for some more datasets which are in glue and not CaDeT but there was no response and we will need to target leads directly. https://asdslack.slack.com/archives/C03J21VFHQ9/p1729176561022829

ministryofjustice / find-moj-data