Closed teeceeas closed 1 month ago
There are some notes on https://github.com/ministryofjustice/find-moj-data/issues/660 for running SQL against the datahub db. Make sure we don't run against the primary in production though. Do we have a read replica that we could point a BI tool to?
Potentially speak to AP to see whether Quicksight on AP could be used for this.
Some options for surfacing missing metadata
This would involve:
The downside of building this into the app is that it's very inflexible. Any time we want to change the reporting we'll need to write new code and redeploy.
There will also be some infrastructure changes required to configure the database access, but not as much as deploying a whole new app from scratch.
Rejected for beta phase due to risk of building things that aren't useful
(To be elaborated after speaking with AP team)
2 options for how data flows in
We can access the Power BI web interface & share reports via Microsoft 365. However, we can't connect this to arbitrary data sources through the web interface, and the desktop version only runs on windows. So I don't think we should use this, given we are all working on macs.
There are open source BI tools we could host ourselves for no additional cost e.g. redash, or Apache Superset. If we adopted one of these, we would need to configure it behind Entra ID, as we have done with Datahub and Find MoJ data.
This would increase the maintenance burden though and increase the costs associated with hosting on the cloud platform.
Rejected for beta phase This seems like overkill for what we actually need at the moment, so I'm ruling this out for now.
A reporting script that runs on a schedule from within the Datahub namespace, and outputs CSV that can be emailed/uploaded somewhere.
This is essentially the same solution as 1) but without a web UI.
As above, but we could author jupyter notebooks to visualise the data as a report.
Possible issues
Metadata completeness framework spreadsheet
This contains the fields that:
I recommend starting out by focusing on owner, subject area and descriptions.
While we're in the PoC / R&D stage it may be wise to avoid #1 ("In Find MoJ Data, or another bespoke app") until we have a much clearer idea of what the reporting requirements are. Even if that is the most desirable end state for surfacing issues and success stories to our users.
Otherwise we risk building things which don't end up being useful.
+1 to "open source reporting tool we self-host" being overkill. We should align with AP rather than standing up alternatives - it's not just the tech & hosting overhead, but potentially the governance and security.
On the spreadsheet, it would also be useful to have breakdowns for populations of owner, subject area, and descriptions on other fields so data owners can see "their" population rates at a glance. A good place to start would be to break the stats down per database/dashboard/container.
@murdo-moj yeah agreed, I've added a list of possible breakdowns to the 2nd tab of the spreadsheet.
I was originally thinking we would break it down by owner, but perhaps Platform/Dashboard/Container is a better starting point, while we have incomplete owners.
urn
Example: table metadata
table_urn | container_urn | container_name | platform_urn | platform_name | name | owner_present | subject_area | description | ... |
---|---|---|---|---|---|---|---|---|---|
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Example:
urn | parent_urn | parent_name | platform_urn | platform_name | entity_type | field name | value | value_length | value_present |
---|---|---|---|---|---|---|---|---|---|
... | ... | ... | ... | ... | table | description | abcdef | 6 | true |
... | ... | ... | ... | ... | table | schema | true | ||
... | ... | ... | ... | ... | column | description | ghijk | 5 | true |
TODO:
If we can get access to Quicksight, try it out with some fake metadata.
E.g.
Spoke to Darren and Julia and the AP Quicksight can definitely support this. For the purposes of the spike I will be able to test in data engineering sandbox A.
Quicksight will integrate nicely with Find MoJ data as it's possible to embed the dashboard into another service. AP can configure it so it's whitelisted to the Find MoJ data domain, and then we will serve it on a new page, behind authentication.
In terms of getting data in, we discussed 2 possible options:
Option 1 is making a direct connection from Quicksight to Datahub's postgres database. This would require some routing and security group amendments so that AP's account can reach addresses internal to the CP. See slack thread on connecting AP and MP
Option 2 is creating an S3 bucket in AP, in the usual way (@LavMatt and @murdo-moj might need to help with this). Then we would schedule a service to run within the Datahub namespace and extract data to the bucket. It is possible to just dump arbitrary data here, but ideally, we would output the data in a format that Athena can read (see data formats supported by Athena), because quicksight integrates well with athena.
It is possible to add controls to the dashboard so the user can filter stuff, but the UI for this is really not great. There are different types of control but the most practical seems to be a drop down list.
Example:
As a starting point, we can include a control for filtering by the parent entity name.
The steps seem to be:
The filters are associated with individual charts within the dashboard. They can include/exclude data from the chart.
Can be used to show the overall completeness of a field as a percentage, or the total number of issues. Shows up as a big number.
Tables can be used to show the details of entities with missing metadata
With pivot tables it is possible to apply sums or percentages across rows/columns of the table, which seems like it should be helpful, but in practice I was unable to get it to do anything useful.
Could be used to visualise the quantity and quality of metadata by parent entity (or domain, platform, or owner)
https://docs.aws.amazon.com/quicksight/latest/user/share-a-dashboard-grant-access-everyone.html
It is possible to embed dashboards on a page in the Find MoJ data app.
See https://docs.aws.amazon.com/quicksight/latest/user/embedded-analytics-dashboards-for-everyone.html
AllowedDomains should be set to our Find MoJ data domains.
As a Developer, I want to conduct a spike to investigate how to identify missing metadata fields, so that we can implement a mechanism to track and report on metadata quality levels and highlight datasets with incomplete metadata.
Acceptance Criteria Investigate how to surface and report on datasets with missing critical fields (e.g., missing data owner, description etc). This is for users interested in data governance, not catalogue end users. Research and define appropriate thresholds for metadata completeness, i.e. what are the critical fields that have to be completed