ministryofjustice / find-moj-data

Find MOJ data service • This repository is defined and managed in Terraform
https://find-moj-data.service.justice.gov.uk/
MIT License
5 stars 0 forks source link

Spike: how might we identify missing metadata fields #910

Closed teeceeas closed 1 month ago

teeceeas commented 2 months ago

As a Developer, I want to conduct a spike to investigate how to identify missing metadata fields, so that we can implement a mechanism to track and report on metadata quality levels and highlight datasets with incomplete metadata.

Acceptance Criteria Investigate how to surface and report on datasets with missing critical fields (e.g., missing data owner, description etc). This is for users interested in data governance, not catalogue end users. Research and define appropriate thresholds for metadata completeness, i.e. what are the critical fields that have to be completed

MatMoore commented 1 month ago

There are some notes on https://github.com/ministryofjustice/find-moj-data/issues/660 for running SQL against the datahub db. Make sure we don't run against the primary in production though. Do we have a read replica that we could point a BI tool to?

Potentially speak to AP to see whether Quicksight on AP could be used for this.

MatMoore commented 1 month ago

Example of setting up a read replica on cloud platform

MatMoore commented 1 month ago

Some options for surfacing missing metadata

1. In Find MoJ Data, or another bespoke app

This would involve:

The downside of building this into the app is that it's very inflexible. Any time we want to change the reporting we'll need to write new code and redeploy.

There will also be some infrastructure changes required to configure the database access, but not as much as deploying a whole new app from scratch.

Rejected for beta phase due to risk of building things that aren't useful

2. In the analytical platform, via Quicksight

(To be elaborated after speaking with AP team)

2 options for how data flows in

3. In Power BI

We can access the Power BI web interface & share reports via Microsoft 365. However, we can't connect this to arbitrary data sources through the web interface, and the desktop version only runs on windows. So I don't think we should use this, given we are all working on macs.

3. In an open source reporting tool we self-host

There are open source BI tools we could host ourselves for no additional cost e.g. redash, or Apache Superset. If we adopted one of these, we would need to configure it behind Entra ID, as we have done with Datahub and Find MoJ data.

This would increase the maintenance burden though and increase the costs associated with hosting on the cloud platform.

Rejected for beta phase This seems like overkill for what we actually need at the moment, so I'm ruling this out for now.

4. Email / file upload (not Google Sheets)

A reporting script that runs on a schedule from within the Datahub namespace, and outputs CSV that can be emailed/uploaded somewhere.

This is essentially the same solution as 1) but without a web UI.

5. Jupyter notebooks

As above, but we could author jupyter notebooks to visualise the data as a report.

Possible issues

MatMoore commented 1 month ago

Metadata completeness framework

Metadata completeness framework spreadsheet

This contains the fields that:

I recommend starting out by focusing on owner, subject area and descriptions.

jemnery commented 1 month ago

While we're in the PoC / R&D stage it may be wise to avoid #1 ("In Find MoJ Data, or another bespoke app") until we have a much clearer idea of what the reporting requirements are. Even if that is the most desirable end state for surfacing issues and success stories to our users.

Otherwise we risk building things which don't end up being useful.

+1 to "open source reporting tool we self-host" being overkill. We should align with AP rather than standing up alternatives - it's not just the tech & hosting overhead, but potentially the governance and security.

murdo-moj commented 1 month ago

On the spreadsheet, it would also be useful to have breakdowns for populations of owner, subject area, and descriptions on other fields so data owners can see "their" population rates at a glance. A good place to start would be to break the stats down per database/dashboard/container.

MatMoore commented 1 month ago

@murdo-moj yeah agreed, I've added a list of possible breakdowns to the 2nd tab of the spreadsheet.

I was originally thinking we would break it down by owner, but perhaps Platform/Dashboard/Container is a better starting point, while we have incomplete owners.

MatMoore commented 1 month ago

Two options for building the datasets

If we want to display metadata about an entity side by side

Example: table metadata

table_urn container_urn container_name platform_urn platform_name name owner_present subject_area description ...
... ... ... ... ... ... ... ... ... ...

If we want to report on a single field at a time

Example:

urn parent_urn parent_name platform_urn platform_name entity_type field name value value_length value_present
... ... ... ... ... table description abcdef 6 true
... ... ... ... ... table schema true
... ... ... ... ... column description ghijk 5 true
MatMoore commented 1 month ago

TODO:

If we can get access to Quicksight, try it out with some fake metadata.

E.g.

MatMoore commented 1 month ago

Spoke to Darren and Julia and the AP Quicksight can definitely support this. For the purposes of the spike I will be able to test in data engineering sandbox A.

Quicksight will integrate nicely with Find MoJ data as it's possible to embed the dashboard into another service. AP can configure it so it's whitelisted to the Find MoJ data domain, and then we will serve it on a new page, behind authentication.

In terms of getting data in, we discussed 2 possible options:

Option 1 is making a direct connection from Quicksight to Datahub's postgres database. This would require some routing and security group amendments so that AP's account can reach addresses internal to the CP. See slack thread on connecting AP and MP

Option 2 is creating an S3 bucket in AP, in the usual way (@LavMatt and @murdo-moj might need to help with this). Then we would schedule a service to run within the Datahub namespace and extract data to the bucket. It is possible to just dump arbitrary data here, but ideally, we would output the data in a format that Athena can read (see data formats supported by Athena), because quicksight integrates well with athena.

MatMoore commented 1 month ago

Dashboard controls

It is possible to add controls to the dashboard so the user can filter stuff, but the UI for this is really not great. There are different types of control but the most practical seems to be a drop down list.

Example: Image

As a starting point, we can include a control for filtering by the parent entity name.

The steps seem to be:

  1. Define a parameter
  2. Define a control for the parameter
  3. Use the parameter in a filter

The filters are associated with individual charts within the dashboard. They can include/exclude data from the chart.

Chart types we might want to use

KPIs

Can be used to show the overall completeness of a field as a percentage, or the total number of issues. Shows up as a big number.

Tables

Tables can be used to show the details of entities with missing metadata

Pivot tables

With pivot tables it is possible to apply sums or percentages across rows/columns of the table, which seems like it should be helpful, but in practice I was unable to get it to do anything useful.

Heatmap

Could be used to visualise the quantity and quality of metadata by parent entity (or domain, platform, or owner)

Image

MatMoore commented 1 month ago

Granting access through Quicksight itself

https://docs.aws.amazon.com/quicksight/latest/user/share-a-dashboard-grant-access-everyone.html

Embedding dashboards in Find MoJ data

It is possible to embed dashboards on a page in the Find MoJ data app.

See https://docs.aws.amazon.com/quicksight/latest/user/embedded-analytics-dashboards-for-everyone.html

AllowedDomains should be set to our Find MoJ data domains.

MatMoore commented 1 month ago

Next steps