Closed seanprivett closed 6 months ago
https://demo.datahubproject.io/dashboard/urn:li:dashboard:(looker,dashboards.13)/Charts?is_lineage_mode=false datahub have a dashboard entity - but only with integration support for looker. Quicksight integration would be manual
Dashboard here is a collection of charts, so it even goes more granular than just dashboard
This ticket is about registering any end products, not just QuickSight dashboards, in the catalogue.
Possible routes to start are:
Catalogue MoJ public dashboards that do not require any authentication. There are few examples that I am aware of and they are not 'simple' dashboards for MoJ staff use, but products intended for public consumption, for example
Catalogue live R-Shiny apps using AP data that have been catalogued already as part of migrating them to the cloud platform. The spreadsheet lists 91 apps, of which 2 are blocked for migration, therefore we know of 89 live apps currently.
Catalogue live QuickSight dashboards (using AP data). We don't currently know how many live QuickSight dashboards exist, @julialawrence has identified a tool that might help surfacing this https://community.amazonquicksight.com/t/measure-the-adoption-of-your-amazon-quicksight-dashboards-and-view-your-bi-portfolio-in-a-single-pane-of-glass/7198
comment from calum on the above: Stats and published MI themselves could be considered ‘public end products’. Just depends on the scope/definition.
Docs for dashboard and chart entities:
https://datahubproject.io/docs/generated/metamodel/entities/dashboard/ https://datahubproject.io/docs/generated/metamodel/entities/chart/
How to add a custom ingestion source: https://datahubproject.io/docs/how/add-custom-ingestion-source/
Example code for looker intergration: https://github.com/datahub-project/datahub/blob/ed10a8d8cca3b17e982db6d14ea435833c5a87ea/metadata-ingestion/src/datahub/ingestion/source/looker/looker_source.py#L766
chart_urn = builder.make_chart_urn(
self.source_config.platform_name, dashboard_element.get_urn_element_id()
)
chart_snapshot = ChartSnapshot(
urn=chart_urn,
aspects=[Status(removed=False)],
)
chart_type = self._get_chart_type(dashboard_element)
chart_info = ChartInfoClass(
type=chart_type,
description=dashboard_element.description or "",
title=dashboard_element.title or "",
lastModified=ChangeAuditStamps(),
chartUrl=dashboard_element.url(self.source_config.external_base_url or ""),
inputs=dashboard_element.get_view_urns(self.source_config),
customProperties={
"upstream_fields": ",".join(
sorted(set(field.name for field in dashboard_element.input_fields))
)
if dashboard_element.input_fields
else ""
},
)
chart_snapshot.aspects.append(chart_info)
if dashboard and dashboard.folder_path is not None:
browse_path = BrowsePathsClass(
paths=[f"/looker/{dashboard.folder_path}/{dashboard.title}"]
)
chart_snapshot.aspects.append(browse_path)
if dashboard is not None:
ownership = self.get_ownership(dashboard)
if ownership is not None:
chart_snapshot.aspects.append(ownership)
dashboard_urn = builder.make_dashboard_urn(
self.source_config.platform_name, looker_dashboard.get_urn_dashboard_id()
)
dashboard_snapshot = DashboardSnapshot(
urn=dashboard_urn,
aspects=[],
)
dashboard_info = DashboardInfoClass(
description=looker_dashboard.description or "",
title=looker_dashboard.title,
charts=chart_urns,
lastModified=self._get_change_audit_stamps(looker_dashboard),
dashboardUrl=looker_dashboard.url(self.source_config.external_base_url),
)
dashboard_snapshot.aspects.append(dashboard_info)
if looker_dashboard.folder_path is not None:
browse_path = BrowsePathsClass(
paths=[f"/looker/{looker_dashboard.folder_path}"]
)
dashboard_snapshot.aspects.append(browse_path)
ownership = self.get_ownership(looker_dashboard)
if ownership is not None:
dashboard_snapshot.aspects.append(ownership)
dashboard_snapshot.aspects.append(Status(removed=looker_dashboard.is_deleted))
dashboard_mce = MetadataChangeEvent(proposedSnapshot=dashboard_snapshot)
proposals: List[Union[MetadataChangeEvent, MetadataChangeProposalWrapper]] = [
dashboard_mce
]
# If extracting embeds is enabled, produce an MCP for embed URL.
if (
self.source_config.extract_embed_urls
and self.source_config.external_base_url
):
proposals.append(
create_embed_mcp(
dashboard_snapshot.urn,
looker_dashboard.embed_url(self.source_config.external_base_url),
)
)
Embedded looker charts are demod here: https://www.youtube.com/watch?v=33hxlg4YgCQ
It's not obvious how we would map this to charts, because
Recommendation: catalogue each individual named chart, e.g. "Average days from police referring a case to the CPS and the CPS authorising a charge" and ignore the chart builder / map builder
The data sources are linked as excel spreadsheets, so we don't have provenance tying these charts to datasets already catalogued.
Recommendation: let's leave the inputs blank for now, and worry about linking this to sources later
Github: https://github.com/ministryofjustice/cjs_scorecard_exploratory_analysis
Dashboard metadata can be taken from this yaml:
Name: "Criminal justice system (CJS) delivery data dashboard"
Category: "App"
Description: "A public facing dashboard which brings together and visualises a range of criminal justice data.
It gives an overview of the justice system; from the point a crime is recorded by the police, to when a case is completed in court.
The data within the dashboard is updated on a quarterly basis in line with stats publications."
Impact: "The CJS dashboard increases transparency and understanding of the justice system.
It has around 400 monthly users (mainly Local Criminal Justice Boards) and at a local level, the Dashboard presents a comprehensive
cross-system view of issues to stakeholders and encourages proactive and collaborative decision-making grounded in evidence-based
practice (a quote taken from the Rape Review progress update report published in July 2023)."
G6 lead: "Kim Brett"
SRO: "Ed Lidington (analytical sign-off = Damon Wingfield)"
Technical lead: "Laura Knowles"
Business lead: " Calum Barnett (Service Owner)"
Last review date: "Nov-23"
Next review date: "Feb-24"
Outage Impact: "Red"
Maintenance (FTE): "2"
Documentation: "https://justiceuk.sharepoint.com/sites/CJSScorecard/Shared%20Documents/Forms/AllItems.aspx"
Contact: "laura.knowles1@justice.gov.uk"
(This already populates the data science asset register)
As far as I can tell there is no obvious API that will give us chart metadata directly.
Chart content could perhaps be scraped from https://github.com/ministryofjustice/cjs_scorecard_exploratory_analysis/tree/develop/cjs_test_app/content
Recommendation: Compile the metadata into a yaml format and write a source that pulls it from a public github url. We can maintain the yaml ourselves for now.
This page contains a series of powerbi dashboards.
Can't link to individual charts, just pages.
There is a short description on the webpage but its unclear who owns them
Can trace sources to published CSV datasets, but not back to datasets we already catalogue.
Recommendation: Catalogue at the dashboard level?
2 more powerbi dashboards
This is another one where we have named charts. One measure = one chart in datahub?
Github: https://github.com/ministryofjustice/justice-data
This has an API: https://data.justice.gov.uk/api So probably the easiest source to get started with.
There is also a publications API we can use for lineage https://data.justice.gov.uk/api/publications - however there is no point ingesting this until we have the actual publications catalogued.
Recommendation: Build some kind of connector that pulls from the justice data api. Leave lineage for later.
Possible charts:
However the exact name of the chart varies depending on what filters are selected.
This dashboard highlights another issue with lineage. For end products like this, historical data may come from a different source than current data: "Source: MoJ COINS database until 2019 and MoJ CE-file database from 2020 onwards"
In datahub we can associate multiple datasets to a chart but we can't represent this time-based distinction https://datahubproject.io/docs/generated/metamodel/entities/chart/#outgoing
https://demo.datahubproject.io/search?filter_platform=urn:li:dataPlatform:looker
This forms the lineage from dataset -> chart -> dashboard
Clicking through to the dataset you can see the schema
Rough diagram of how the ingestion source is structured, using the looker source as an example.
MetadataChangeEvent
or MetadataChangeProposalWrapper
- the difference is MCEs may change multiple aspects, but a an MCP changes only onehttps://github.com/acryldata/meta-world/blob/master/custom_sources/src/my-source/custom_ingestion_source.py is a simpler example that overrides get_workunit_processors
rather than get_workunit_processors_internal
. However this bypasses some default behaviour to do with browse paths and lowercasing URNs.
However this demonstrates the process of converting from MCEs to MWUs
item = MetadataChangeEvent.from_obj(obj)
wu = MetadataWorkUnit("single_mce", mce=item)
self.report.report_workunit(wu)
I missed this before but there is an existing source for ingesting PowerBI
https://datahubproject.io/docs/generated/ingestion/sources/powerbi/#starter-recipe
Catalogue justice data visualisations
Summary of spike questions
Q. How much can we automate/scrape public-facing external apps?
A. Partially.
Q. Or manually register the thing?
A. manual registration seems feasible for dashboard themselves, less so for the individual charts that make up a dashboard, because that metadata is difficult to produce and will easily become stale. In some cases it's not even obvious what would constitute a chart, or the charts are not directly linkable.
[Mat's opinion] The more we rely on manual registration, the more maintenance is needed to remove stale metadata. So it's a bad idea to manually register a lot of dashboards in the catalogue without any identified owner - it will just become a liability for us.
Q. How does this sit in the structure/model in DataHub?
A. A dashboard has many Charts. Charts may link to datasets. When we talk about bringing in charts and dashboards, we should also consider bringing in the data backing the charts as distinct datasets. This might be a CSV, excel, some dataset internal to the dashboard software, etc.
Q. How does this affect the frontend?
A. We will need to expose Dashboard entity type in the frontend, and write code to present it in a sensible way. We need to decide whether to also surface Chart entities. We should consider whether to include a filter for entity type.