Spike: how might we identify missing metadata fields

teeceeas commented 2 months ago

As a Developer, I want to conduct a spike to investigate how to identify missing metadata fields, so that we can implement a mechanism to track and report on metadata quality levels and highlight datasets with incomplete metadata.

Acceptance Criteria Investigate how to surface and report on datasets with missing critical fields (e.g., missing data owner, description etc). This is for users interested in data governance, not catalogue end users. Research and define appropriate thresholds for metadata completeness, i.e. what are the critical fields that have to be completed

MatMoore commented 1 month ago

There are some notes on https://github.com/ministryofjustice/find-moj-data/issues/660 for running SQL against the datahub db. Make sure we don't run against the primary in production though. Do we have a read replica that we could point a BI tool to?

Potentially speak to AP to see whether Quicksight on AP could be used for this.

MatMoore commented 1 month ago

Example of setting up a read replica on cloud platform

MatMoore commented 1 month ago

Some options for surfacing missing metadata

1. In Find MoJ Data, or another bespoke app

This would involve:

allowing Find MoJ data to connect to a replica of the Datahub DB (or setting up an ETL pipeline to move the data into Find MoJ DB)
designing one or more reports
- what filters can be applied?
- how do we display the results? tabular data? export to csv?
writing some queries that provide the data for the above

The downside of building this into the app is that it's very inflexible. Any time we want to change the reporting we'll need to write new code and redeploy.

There will also be some infrastructure changes required to configure the database access, but not as much as deploying a whole new app from scratch.

Rejected for beta phase due to risk of building things that aren't useful

2. In the analytical platform, via Quicksight

(To be elaborated after speaking with AP team)

2 options for how data flows in

via a script that generates CSV (see option 4), and register-my-data (preferable)
via a direct database connection between AP and CP

3. In Power BI

We can access the Power BI web interface & share reports via Microsoft 365. However, we can't connect this to arbitrary data sources through the web interface, and the desktop version only runs on windows. So I don't think we should use this, given we are all working on macs.

3. In an open source reporting tool we self-host

There are open source BI tools we could host ourselves for no additional cost e.g. redash, or Apache Superset. If we adopted one of these, we would need to configure it behind Entra ID, as we have done with Datahub and Find MoJ data.

This would increase the maintenance burden though and increase the costs associated with hosting on the cloud platform.

Rejected for beta phase This seems like overkill for what we actually need at the moment, so I'm ruling this out for now.

4. Email / file upload (not Google Sheets)

A reporting script that runs on a schedule from within the Datahub namespace, and outputs CSV that can be emailed/uploaded somewhere.

This is essentially the same solution as 1) but without a web UI.

5. Jupyter notebooks

As above, but we could author jupyter notebooks to visualise the data as a report.

Possible issues

To make it accessible, we'd still need to host the notebook somewhere (but we probably don't want it to be public)
We can present high level data but there would be no way to interactively drill down into the data / filter etc. So we would have to anticipate what to include and what not to.

MatMoore commented 1 month ago

Metadata completeness framework

Metadata completeness framework spreadsheet

This contains the fields that:

are displayed in Find MoJ data
are not automatically populated (e.g. column types)

I recommend starting out by focusing on owner, subject area and descriptions.

The most critical is that some kind of owner is set - without it we have no accountability and reporting on the other metadata items is unlikely to get us anywhere.
Subject area is important as it helps users find data that is relevant to them
Description is important as it helps user decide whether the data they have found is relevant or not

jemnery commented 1 month ago

While we're in the PoC / R&D stage it may be wise to avoid #1 ("In Find MoJ Data, or another bespoke app") until we have a much clearer idea of what the reporting requirements are. Even if that is the most desirable end state for surfacing issues and success stories to our users.

Otherwise we risk building things which don't end up being useful.

+1 to "open source reporting tool we self-host" being overkill. We should align with AP rather than standing up alternatives - it's not just the tech & hosting overhead, but potentially the governance and security.

murdo-moj commented 1 month ago

On the spreadsheet, it would also be useful to have breakdowns for populations of owner, subject area, and descriptions on other fields so data owners can see "their" population rates at a glance. A good place to start would be to break the stats down per database/dashboard/container.

MatMoore commented 1 month ago

@murdo-moj yeah agreed, I've added a list of possible breakdowns to the 2nd tab of the spreadsheet.

I was originally thinking we would break it down by owner, but perhaps Platform/Dashboard/Container is a better starting point, while we have incomplete owners.

MatMoore commented 1 month ago

Two options for building the datasets

If we want to display metadata about an entity side by side

Create one dataset per entity type (container, table, chart, dashboard), keyed by urn
Create one more dataset for table columns, keyed by table URN and column name
Include dimensions for container, platform, owner

Example: table metadata

table_urn	container_urn	container_name	platform_urn	platform_name	name	owner_present	subject_area	description	...
...	...	...	...	...	...	...	...	...	...

If we want to report on a single field at a time

Produce a dataset keyed by urn, entity type, field name
Include dimensions for parent entity, platform, owner

Example:

urn	parent_urn	parent_name	platform_urn	platform_name	entity_type	field name	value	value_length	value_present
...	...	...	...	...	table	description	abcdef	6	true
...	...	...	...	...	table	schema			true
...	...	...	...	...	column	description	ghijk	5	true

MatMoore commented 1 month ago

TODO:

If we can get access to Quicksight, try it out with some fake metadata.

E.g.

For a particular platform/container, show the % completeness of a a field (e.g. table description), aggregated across all tables
For a particular platform/container, show the % completeness of a field (e.g. column description), aggregated across all columns and tables
Present all tables/columns with missing field X, filtered by platform/container

MatMoore commented 1 month ago

Spoke to Darren and Julia and the AP Quicksight can definitely support this. For the purposes of the spike I will be able to test in data engineering sandbox A.

Quicksight will integrate nicely with Find MoJ data as it's possible to embed the dashboard into another service. AP can configure it so it's whitelisted to the Find MoJ data domain, and then we will serve it on a new page, behind authentication.

In terms of getting data in, we discussed 2 possible options:

Option 1 is making a direct connection from Quicksight to Datahub's postgres database. This would require some routing and security group amendments so that AP's account can reach addresses internal to the CP. See slack thread on connecting AP and MP

Option 2 is creating an S3 bucket in AP, in the usual way (@LavMatt and @murdo-moj might need to help with this). Then we would schedule a service to run within the Datahub namespace and extract data to the bucket. It is possible to just dump arbitrary data here, but ideally, we would output the data in a format that Athena can read (see data formats supported by Athena), because quicksight integrates well with athena.

MatMoore commented 1 month ago

Dashboard controls

It is possible to add controls to the dashboard so the user can filter stuff, but the UI for this is really not great. There are different types of control but the most practical seems to be a drop down list.

Example:

As a starting point, we can include a control for filtering by the parent entity name.

The steps seem to be:

The filters are associated with individual charts within the dashboard. They can include/exclude data from the chart.

Chart types we might want to use

KPIs

Can be used to show the overall completeness of a field as a percentage, or the total number of issues. Shows up as a big number.

Tables

Tables can be used to show the details of entities with missing metadata

Set the group by dimension to aggregate across parent entity (or domain, platform or owner)
Add a value for each measure you want to display, and set it to SUM
Add a filter to

Pivot tables

With pivot tables it is possible to apply sums or percentages across rows/columns of the table, which seems like it should be helpful, but in practice I was unable to get it to do anything useful.

Heatmap

Could be used to visualise the quantity and quality of metadata by parent entity (or domain, platform, or owner)

Set group by dimension to parent entity (or domain, platform, or owner)
Set colour dimension to a completeness metric

MatMoore commented 1 month ago

Granting access through Quicksight itself

https://docs.aws.amazon.com/quicksight/latest/user/share-a-dashboard-grant-access-everyone.html

Embedding dashboards in Find MoJ data

It is possible to embed dashboards on a page in the Find MoJ data app.

See https://docs.aws.amazon.com/quicksight/latest/user/embedded-analytics-dashboards-for-everyone.html

AllowedDomains should be set to our Find MoJ data domains.

MatMoore commented 1 month ago

ministryofjustice / find-moj-data