✨ Spike: how to add data from HMCTS Purview into the Catalogue

YvanMOJdigital commented 3 weeks ago

Describe the feature request.

Describe the potential process, poc if possible.

Describe the context.

Gustav Moller Heya Yvan. I had a chat with Jeremy + some colleagues from HMCTS yesterday about the MoJ Data Catalogue, and how to populate it with information from the HMCTS Purview Data Catalogue. At the end of the call we agreed to have another call in a month or so. Jeremy said the team should be able to present some idea around how to add data from HMCTS Purview into the MoJ Data Catalogue. Does that work for you? Next call would probably be 18th or 25th September

Notes

May be dependent on https://github.com/ministryofjustice/data-catalogue/issues/250 that adds Azure Data Storage (ADS) support for DataHub.

Value / Purpose

No response

User Types

No response

MatMoore commented 2 weeks ago

Purview API: https://learn.microsoft.com/en-us/rest/api/purview/

Seems like this is part of the Azure Resource Manager API and we can use https://github.com/AzureAD/microsoft-authentication-library-for-python to handle the auth.

Since we're running the ingestion non-interactively we'll need to use the client credentials flow.

Client registration steps are documented here: https://learn.microsoft.com/en-us/rest/api/azure/#client-registration

Register the client application with Microsoft Entra ID.
Set permission requests to allow the client to access the Azure Resource Manager API.
Configure Azure Resource Manager Role-Based Access Control (RBAC) settings for authorizing the client.

MatMoore commented 2 weeks ago

Breaking this down a bit...

We have two options:

ingest data from Purview itself
ingest data from each data source that is configured in Purview

Also, each of these could be a push or a pull

Our threat modelling has been predicated on the assumption that datahub should not have the "keys to the kingdom", so we should set limits on what our ingestion job can access
We're happy connecting to AP, as we are in the same service area, and we only need read permission on an S3 bucket
We could decide to allow read access to other data catalogues, but not from the databases themselves
A push based solution could work like https://github.com/ministryofjustice/data-engineering-data-extractor - give teams a docker image they can run on their own infrastructure, which spits out metadata we can consume
We could also crawl github repos for metadata in a json/yaml format (like HMPPS are doing).
We have a separate spike to explore the various architectures but we have not prioritised this yet

@murdo-moj is talking to HMCTS about how they are using Purview and what data sources they actually have. There is a risk they might not continue using Purview.

Next steps:

[x] check what is supported by datahub out of the box for the Azure platform
[x] see if there's any simple ways of exporting metadata from Purview
[x] try out the Purview API for exporting metadata

MatMoore commented 2 weeks ago

Can you export metadata from Purview?

Exporting metadata is only supported for Business Assets, which is a bit useless. https://techcommunity.microsoft.com/t5/security-compliance-and-identity/now-in-preview-export-your-business-assets-from-microsoft/ba-p/3859055

You can also export glossary terms.

Seems like the only way to get databases, tables. files etc is via their API https://learn.microsoft.com/en-us/rest/api/purview/

MatMoore commented 2 weeks ago

What Datahub sources can connect to Azure out of the box?

The latest version added Azure Blob Storage (S3 equivalent)
Various SQL databases
Power BI
Purview itself is not an option - we'd need to build it ourselves
- Purview metadata does not get surfaced in any other Azure service

LavMatt commented 2 weeks ago

Murdo and I met with HMCTS devs for their purview catalogue.

We are getting purview access for all our engineers - We can test the api via our users, but i think we'll need another meeting to discuss the feasibility of setting up a client for us to access purview as there was some confusion around this as a possibility

MatMoore commented 2 weeks ago

Is it possible to export purview data via its API?

Yes this is possible using its Discovery API to list entities

Here's my proof of concept using the Azure SDK:

import sys
from os import environ

from azure.identity import ClientSecretCredential
from azure.purview.catalog import PurviewCatalogClient

account_name = environ["ACCOUNT_NAME"]

credential = ClientSecretCredential(
    tenant_id=environ["AZURE_TENANT_ID"],
    client_id=environ["AZURE_CLIENT_ID"],
    client_secret=environ["AZURE_CLIENT_SECRET"]
)

client = PurviewCatalogClient(
    endpoint=f"https://{account_name}.purview.azure.com",
    credential=credential
)

result = client.discovery.query({})

for item in result["value"]:
    print(
        f'{item.get("entityType", item.get("objectType"))}: {item["name"]} ({item.get("displayText", "")})'
    )
    print(item.get("description", "No description"))
    print("")

full code and instructions here: https://github.com/MatMoore/azure-purview-experiments

There are additional APIs for Entity level and lineage information. I haven't explored this, so if moving forwards with this we will need to do a mapping exercise between Purview and Datahub metadata.

There is also separate Atlas APIs which may be easier to work with. There is a separate python library for this..

Something like this could be used as part of a one-off ingestion, a custom ingestion source for Datahub, or as an extractor tool that is decoupled from our service (related spike: https://github.com/ministryofjustice/data-catalogue-metadata/issues/14).

jemnery commented 2 weeks ago

I suppose the "keys to the kingdom" argument depends on whether our API secret would grant access to just metadata (acceptable risk) or if we're granted access to their Azure data platform.

From the above it looks like we're ruling out bypassing Purview and ingesting from Azure? I think that's sensible, so that

we guarantee we're only exposing data assets they've also chosen to expose
we capture any metadata enhancements added within Purview

MatMoore commented 2 weeks ago

@jemnery I haven't completely ruled it out, but I definitely agree that ingesting from Purview is preferable to bypassing it, for those same reasons. When testing this I set up a token with just read access to Purview, so there was no way for that credential to alter anything or expose any data.

The advantage of ingesting direct from blob storage or SQL databases is that its possible in Datahub out of the box, but it just seems like that approach would become impossible to manage centrally as more data assets are added (both credential management and recipe configuration would need input from HMCTS)

ministryofjustice / find-moj-data