ministryofjustice / find-moj-data

Find MOJ data service • This repository is defined and managed in Terraform
MIT License
5 stars 0 forks source link

Spike: how might we identify missing metadata fields #910

Open teeceeas opened 2 weeks ago

teeceeas commented 2 weeks ago

As a Developer, I want to conduct a spike to investigate how to identify missing metadata fields, so that we can implement a mechanism to track and report on metadata quality levels and highlight datasets with incomplete metadata.

Acceptance Criteria Investigate how to surface and report on datasets with missing critical fields (e.g., missing data owner, description etc). This is for users interested in data governance, not catalogue end users. Research and define appropriate thresholds for metadata completeness, i.e. what are the critical fields that have to be completed

MatMoore commented 1 week ago

There are some notes on https://github.com/ministryofjustice/find-moj-data/issues/660 for running SQL against the datahub db. Make sure we don't run against the primary in production though. Do we have a read replica that we could point a BI tool to?

Potentially speak to AP to see whether Quicksight on AP could be used for this.

MatMoore commented 1 day ago

Example of setting up a read replica on cloud platform

MatMoore commented 1 day ago

Some options for surfacing missing metadata

1. In Find MoJ Data, or another bespoke app

This would involve:

The downside of building this into the app is that it's very inflexible. Any time we want to change the reporting we'll need to write new code and redeploy.

There will also be some infrastructure changes required to configure the database access, but not as much as deploying a whole new app from scratch.

Rejected for beta phase due to risk of building things that aren't useful

2. In the analytical platform, via Quicksight

(To be elaborated after speaking with AP team)

2 options for how data flows in

3. In Power BI

We can access the Power BI web interface & share reports via Microsoft 365. However, we can't connect this to arbitrary data sources through the web interface, and the desktop version only runs on windows. So I don't think we should use this, given we are all working on macs.

3. In an open source reporting tool we self-host

There are open source BI tools we could host ourselves for no additional cost e.g. redash, or Apache Superset. If we adopted one of these, we would need to configure it behind Entra ID, as we have done with Datahub and Find MoJ data.

This would increase the maintenance burden though and increase the costs associated with hosting on the cloud platform.

Rejected for beta phase This seems like overkill for what we actually need at the moment, so I'm ruling this out for now.

4. Email / file upload (not Google Sheets)

A reporting script that runs on a schedule from within the Datahub namespace, and outputs CSV that can be emailed/uploaded somewhere.

This is essentially the same solution as 1) but without a web UI.

5. Jupyter notebooks

As above, but we could author jupyter notebooks to visualise the data as a report.

Possible issues

MatMoore commented 1 day ago

Metadata completeness framework

Metadata completeness framework spreadsheet

This contains the fields that:

I recommend starting out by focusing on owner, subject area and descriptions.

jemnery commented 22 hours ago

While we're in the PoC / R&D stage it may be wise to avoid #1 ("In Find MoJ Data, or another bespoke app") until we have a much clearer idea of what the reporting requirements are. Even if that is the most desirable end state for surfacing issues and success stories to our users.

Otherwise we risk building things which don't end up being useful.

+1 to "open source reporting tool we self-host" being overkill. We should align with AP rather than standing up alternatives - it's not just the tech & hosting overhead, but potentially the governance and security.

murdo-moj commented 21 hours ago

On the spreadsheet, it would also be useful to have breakdowns for populations of owner, subject area, and descriptions on other fields so data owners can see "their" population rates at a glance. A good place to start would be to break the stats down per database/dashboard/container.

MatMoore commented 21 hours ago

@murdo-moj yeah agreed, I've added a list of possible breakdowns to the 2nd tab of the spreadsheet.

I was originally thinking we would break it down by owner, but perhaps Platform/Dashboard/Container is a better starting point, while we have incomplete owners.

MatMoore commented 20 hours ago

A possible approach for getting metadata into the AP

Diagram of a typical AP pipeline

We should implement the following transformations as part of the airflow pipeline (or in CaDeT?):

The output could look something like this:

platform dimension

platform_urn platform_name
... Create a Derived Table
... Justice Data

container metadata

container_urn platform_urn owner_present subject_area name display_name description ...
... ... ... ... ... ... ... ...

table metadata

table_urn container_urn platform_urn name owner_present subject_area description ...
... ... ... ... ... ... ... ...

column metadata

column_name container_urn table_urn platform_urn description ...
... ... ... ... ... ...
MatMoore commented 19 hours ago

TODO:

If we can get access to Quicksight, try it out with some fake metadata.

E.g.