Spike: present a dashboard in the catalogue

seanprivett commented 10 months ago

Catalogue justice data visualisations

Summary of spike questions

Q. How much can we automate/scrape public-facing external apps?

A. Partially.

There is an ingestion source for PowerBI. This will cover some of the dashboards mentioned here but this is as yet untested. May be hard to filter this down to actually useful dashboards, rather than everything anyone has ever created in powerBI. Worth exploring further.
We can build a custom source that pulls from the Justice Data API (proof of concept). This requires custom code to be written but it's doable thanks to the metadata already being exposed via api.
We can ingest dashboard level metadata via existing yaml created for the data science asset register. This does not include chart-level information though.
In general, we want to avoid having to maintain lots of bespoke ingestion code based on formats dictated by other teams, as it's not feasible for us to understand the whole estate. Long term it would be better to require the technical data owners to hit an API we provide or provide data to use in a format we define.
The other public dashboards are not easily scrapeable at the moment.
We could build a custom source ourselves for AWS quicksight, perhaps with help from the community. A few people have asked about this on datahub slack but as yet nobody has built one. We could also vote for it on the roadmap.

Q. Or manually register the thing?

A. manual registration seems feasible for dashboard themselves, less so for the individual charts that make up a dashboard, because that metadata is difficult to produce and will easily become stale. In some cases it's not even obvious what would constitute a chart, or the charts are not directly linkable.

[Mat's opinion] The more we rely on manual registration, the more maintenance is needed to remove stale metadata. So it's a bad idea to manually register a lot of dashboards in the catalogue without any identified owner - it will just become a liability for us.

Q. How does this sit in the structure/model in DataHub?

A. A dashboard has many Charts. Charts may link to datasets. When we talk about bringing in charts and dashboards, we should also consider bringing in the data backing the charts as distinct datasets. This might be a CSV, excel, some dataset internal to the dashboard software, etc.

Q. How does this affect the frontend?

A. We will need to expose Dashboard entity type in the frontend, and write code to present it in a sensible way. We need to decide whether to also surface Chart entities. We should consider whether to include a filter for entity type.

murdo-moj commented 10 months ago

https://demo.datahubproject.io/dashboard/urn:li:dashboard:(looker,dashboards.13)/Charts?is_lineage_mode=false datahub have a dashboard entity - but only with integration support for looker. Quicksight integration would be manual

Dashboard here is a collection of charts, so it even goes more granular than just dashboard

alex-vonfeldmann commented 9 months ago

This ticket is about registering any end products, not just QuickSight dashboards, in the catalogue.

Possible routes to start are:

Catalogue MoJ public dashboards that do not require any authentication. There are few examples that I am aware of and they are not 'simple' dashboards for MoJ staff use, but products intended for public consumption, for example

CJS dashboard https://criminal-justice-delivery-data-dashboards.justice.gov.uk/
Legal aid statistics data visualisation tools https://www.gov.uk/government/statistics/a-guide-to-legal-aid-statistics-in-england-and-wales/legal-aid-statistics-data-visualisation-tools
Family Court Data Visualisation Tools https://www.gov.uk/government/statistics/family-court-statistics-quarterly-july-to-september-2023/family-court-data-visualisation-tools
Justice in Numbers https://data.justice.gov.uk/justice-in-numbers
Judicial Review Data Visualisation Tool https://www.gov.uk/government/statistics/civil-justice-statistics-quarterly-january-to-march-2022/judicial-review-data-visualisation-tool

Catalogue live R-Shiny apps using AP data that have been catalogued already as part of migrating them to the cloud platform. The spreadsheet lists 91 apps, of which 2 are blocked for migration, therefore we know of 89 live apps currently.

App migration spreadsheet https://docs.google.com/spreadsheets/d/1plYjK7UXCQkbTjgdhcZy-8qatdCLMFKaA2qgxHBzoaM/edit#gid=1709044552.
App migration Trello board,https://trello.com/b/nUnHWXDe/data-platform-apps-

Catalogue live QuickSight dashboards (using AP data). We don't currently know how many live QuickSight dashboards exist, @julialawrence has identified a tool that might help surfacing this https://community.amazonquicksight.com/t/measure-the-adoption-of-your-amazon-quicksight-dashboards-and-view-your-bi-portfolio-in-a-single-pane-of-glass/7198

alex-vonfeldmann commented 9 months ago

comment from calum on the above: Stats and published MI themselves could be considered ‘public end products’. Just depends on the scope/definition.

MatMoore commented 9 months ago

Docs for dashboard and chart entities:

https://datahubproject.io/docs/generated/metamodel/entities/dashboard/ https://datahubproject.io/docs/generated/metamodel/entities/chart/

How to add a custom ingestion source: https://datahubproject.io/docs/how/add-custom-ingestion-source/

Example code for looker intergration: https://github.com/datahub-project/datahub/blob/ed10a8d8cca3b17e982db6d14ea435833c5a87ea/metadata-ingestion/src/datahub/ingestion/source/looker/looker_source.py#L766

        chart_urn = builder.make_chart_urn(
            self.source_config.platform_name, dashboard_element.get_urn_element_id()
        )
        chart_snapshot = ChartSnapshot(
            urn=chart_urn,
            aspects=[Status(removed=False)],
        )

        chart_type = self._get_chart_type(dashboard_element)
        chart_info = ChartInfoClass(
            type=chart_type,
            description=dashboard_element.description or "",
            title=dashboard_element.title or "",
            lastModified=ChangeAuditStamps(),
            chartUrl=dashboard_element.url(self.source_config.external_base_url or ""),
            inputs=dashboard_element.get_view_urns(self.source_config),
            customProperties={
                "upstream_fields": ",".join(
                    sorted(set(field.name for field in dashboard_element.input_fields))
                )
                if dashboard_element.input_fields
                else ""
            },
        )
        chart_snapshot.aspects.append(chart_info)

        if dashboard and dashboard.folder_path is not None:
            browse_path = BrowsePathsClass(
                paths=[f"/looker/{dashboard.folder_path}/{dashboard.title}"]
            )
            chart_snapshot.aspects.append(browse_path)

        if dashboard is not None:
            ownership = self.get_ownership(dashboard)
            if ownership is not None:
                chart_snapshot.aspects.append(ownership)

        dashboard_urn = builder.make_dashboard_urn(
            self.source_config.platform_name, looker_dashboard.get_urn_dashboard_id()
        )
        dashboard_snapshot = DashboardSnapshot(
            urn=dashboard_urn,
            aspects=[],
        )

        dashboard_info = DashboardInfoClass(
            description=looker_dashboard.description or "",
            title=looker_dashboard.title,
            charts=chart_urns,
            lastModified=self._get_change_audit_stamps(looker_dashboard),
            dashboardUrl=looker_dashboard.url(self.source_config.external_base_url),
        )

        dashboard_snapshot.aspects.append(dashboard_info)
        if looker_dashboard.folder_path is not None:
            browse_path = BrowsePathsClass(
                paths=[f"/looker/{looker_dashboard.folder_path}"]
            )
            dashboard_snapshot.aspects.append(browse_path)

        ownership = self.get_ownership(looker_dashboard)
        if ownership is not None:
            dashboard_snapshot.aspects.append(ownership)

        dashboard_snapshot.aspects.append(Status(removed=looker_dashboard.is_deleted))

        dashboard_mce = MetadataChangeEvent(proposedSnapshot=dashboard_snapshot)

        proposals: List[Union[MetadataChangeEvent, MetadataChangeProposalWrapper]] = [
            dashboard_mce
        ]

        # If extracting embeds is enabled, produce an MCP for embed URL.
        if (
            self.source_config.extract_embed_urls
            and self.source_config.external_base_url
        ):
            proposals.append(
                create_embed_mcp(
                    dashboard_snapshot.urn,
                    looker_dashboard.embed_url(self.source_config.external_base_url),
                )
            )

Embedded looker charts are demod here: https://www.youtube.com/watch?v=33hxlg4YgCQ

To find out

[ ] The charts inputs are URNs of datasets that feed the dashboard - do we have that information?
[ ] Can/should we embed our public dashboards/charts?

MatMoore commented 9 months ago

Criminal justice system delivery data dashboard

It's not obvious how we would map this to charts, because

There are map and chart builders that can generate different charts
Multiple charts are grouped together on one page https://criminal-justice-delivery-data-dashboards.justice.gov.uk/improving-timeliness/police#time_to_success-national--chart but datahub describes charts as "A single data vizualization derived from a Dataset."

Recommendation: catalogue each individual named chart, e.g. "Average days from police referring a case to the CPS and the CPS authorising a charge" and ignore the chart builder / map builder

The data sources are linked as excel spreadsheets, so we don't have provenance tying these charts to datasets already catalogued.

Recommendation: let's leave the inputs blank for now, and worry about linking this to sources later

Github: https://github.com/ministryofjustice/cjs_scorecard_exploratory_analysis

Dashboard metadata can be taken from this yaml:

Name: "Criminal justice system (CJS) delivery data dashboard"
Category: "App"
Description: "A public facing dashboard which brings together and visualises a range of criminal justice data.
It gives an overview of the justice system; from the point a crime is recorded by the police, to when a case is completed in court.
The data within the dashboard is updated on a quarterly basis in line with stats publications."
Impact: "The CJS dashboard increases transparency and understanding of the justice system.
It has around 400 monthly users (mainly Local Criminal Justice Boards) and at a local level, the Dashboard presents a comprehensive
cross-system view of issues to stakeholders and encourages proactive and collaborative decision-making grounded in evidence-based
practice (a quote taken from the Rape Review progress update report published in July 2023)."
G6 lead: "Kim Brett"
SRO: "Ed Lidington (analytical sign-off = Damon Wingfield)"
Technical lead: "Laura Knowles"
Business lead: " Calum Barnett (Service Owner)"
Last review date: "Nov-23"
Next review date: "Feb-24"
Outage Impact: "Red"
Maintenance (FTE): "2"
Documentation: "https://justiceuk.sharepoint.com/sites/CJSScorecard/Shared%20Documents/Forms/AllItems.aspx"
Contact: "laura.knowles1@justice.gov.uk"

(This already populates the data science asset register)

As far as I can tell there is no obvious API that will give us chart metadata directly.

Chart content could perhaps be scraped from https://github.com/ministryofjustice/cjs_scorecard_exploratory_analysis/tree/develop/cjs_test_app/content

Recommendation: Compile the metadata into a yaml format and write a source that pulls it from a public github url. We can maintain the yaml ourselves for now.

MatMoore commented 9 months ago

Legal aid statistics

This page contains a series of powerbi dashboards.

Can't link to individual charts, just pages.

There is a short description on the webpage but its unclear who owns them

Can trace sources to published CSV datasets, but not back to datasets we already catalogue.

Recommendation: Catalogue at the dashboard level?

Family court visualisation tools

2 more powerbi dashboards

Justice data

This is another one where we have named charts. One measure = one chart in datahub?

Github: https://github.com/ministryofjustice/justice-data

This has an API: https://data.justice.gov.uk/api So probably the easiest source to get started with.

There is also a publications API we can use for lineage https://data.justice.gov.uk/api/publications - however there is no point ingesting this until we have the actual publications catalogued.

Recommendation: Build some kind of connector that pulls from the justice data api. Leave lineage for later.

MatMoore commented 8 months ago

Judicial review interactive data tool

Possible charts:

Topics of judicial reviews
Progression of judicial reviews

However the exact name of the chart varies depending on what filters are selected.

This dashboard highlights another issue with lineage. For end products like this, historical data may come from a different source than current data: "Source: MoJ COINS database until 2019 and MoJ CE-file database from 2020 onwards"

In datahub we can associate multiple datasets to a chart but we can't represent this time-based distinction https://datahubproject.io/docs/generated/metamodel/entities/chart/#outgoing

MatMoore commented 8 months ago

What does the looker integration look like?

https://demo.datahubproject.io/search?filter_platform=urn:li:dataPlatform:looker

Dashboards and charts link to each other

https://demo.datahubproject.io/dashboard/urn:li:dashboard:(looker,dashboards.thelook::web_analytics_overview)/Charts?is_lineage_mode=false

Charts are associated with a "Looker explore" dataset

https://demo.datahubproject.io/chart/urn:li:chart:(looker,dashboard_elements.06669917b85dd81ce6a67210981bf0f9)/Lineage?filter_degree___false___EQUAL___0=1&is_lineage_mode=false&page=1&unionType=0

This forms the lineage from dataset -> chart -> dashboard

Clicking through to the dataset you can see the schema

You can view charts from the catalogue

"View in looker" call to action
Charts may be embedded

MatMoore commented 8 months ago

Rough diagram of how the ingestion source is structured, using the looker source as an example.

We build up a sequence of MetadataChangeEvent or MetadataChangeProposalWrapper - the difference is MCEs may change multiple aspects, but a an MCP changes only one
We map those to MetadataWorkUnits
In the case of looker, this processing is parallelised via a BackpressureAwareExecutor
Yield the MetadataWorkUnits

https://github.com/acryldata/meta-world/blob/master/custom_sources/src/my-source/custom_ingestion_source.py is a simpler example that overrides get_workunit_processors rather than get_workunit_processors_internal. However this bypasses some default behaviour to do with browse paths and lowercasing URNs.

However this demonstrates the process of converting from MCEs to MWUs

            item = MetadataChangeEvent.from_obj(obj)
            wu = MetadataWorkUnit("single_mce", mce=item)
            self.report.report_workunit(wu)

MatMoore commented 8 months ago

I missed this before but there is an existing source for ingesting PowerBI

https://datahubproject.io/docs/generated/ingestion/sources/powerbi/#starter-recipe

ministryofjustice / data-catalogue