ministryofjustice / find-moj-data

Find MOJ data service • This repository is defined and managed in Terraform
MIT License
5 stars 0 forks source link

✨ Spike: how to add data from HMCTS Purview into the Catalogue #701

Closed YvanMOJdigital closed 1 week ago

YvanMOJdigital commented 3 weeks ago

Describe the feature request.

Describe the potential process, poc if possible.

Describe the context.

Gustav Moller Heya Yvan. I had a chat with Jeremy + some colleagues from HMCTS yesterday about the MoJ Data Catalogue, and how to populate it with information from the HMCTS Purview Data Catalogue. At the end of the call we agreed to have another call in a month or so. Jeremy said the team should be able to present some idea around how to add data from HMCTS Purview into the MoJ Data Catalogue. Does that work for you? Next call would probably be 18th or 25th September

Notes

May be dependent on https://github.com/ministryofjustice/data-catalogue/issues/250 that adds Azure Data Storage (ADS) support for DataHub.

Value / Purpose

No response

User Types

No response

MatMoore commented 2 weeks ago

Purview API: https://learn.microsoft.com/en-us/rest/api/purview/

Seems like this is part of the Azure Resource Manager API and we can use https://github.com/AzureAD/microsoft-authentication-library-for-python to handle the auth.

Since we're running the ingestion non-interactively we'll need to use the client credentials flow.

Client registration steps are documented here: https://learn.microsoft.com/en-us/rest/api/azure/#client-registration

MatMoore commented 2 weeks ago

Breaking this down a bit...

We have two options:

  1. ingest data from Purview itself
  2. ingest data from each data source that is configured in Purview

Also, each of these could be a push or a pull

@murdo-moj is talking to HMCTS about how they are using Purview and what data sources they actually have. There is a risk they might not continue using Purview.

Next steps:

MatMoore commented 2 weeks ago

Can you export metadata from Purview?

Exporting metadata is only supported for Business Assets, which is a bit useless. https://techcommunity.microsoft.com/t5/security-compliance-and-identity/now-in-preview-export-your-business-assets-from-microsoft/ba-p/3859055

You can also export glossary terms.

Seems like the only way to get databases, tables. files etc is via their API https://learn.microsoft.com/en-us/rest/api/purview/

MatMoore commented 2 weeks ago

What Datahub sources can connect to Azure out of the box?

LavMatt commented 2 weeks ago

Murdo and I met with HMCTS devs for their purview catalogue.

We are getting purview access for all our engineers - We can test the api via our users, but i think we'll need another meeting to discuss the feasibility of setting up a client for us to access purview as there was some confusion around this as a possibility

MatMoore commented 2 weeks ago

Is it possible to export purview data via its API?

Yes this is possible using its Discovery API to list entities

Here's my proof of concept using the Azure SDK:

import sys
from os import environ

from azure.identity import ClientSecretCredential
from azure.purview.catalog import PurviewCatalogClient

account_name = environ["ACCOUNT_NAME"]

credential = ClientSecretCredential(
    tenant_id=environ["AZURE_TENANT_ID"],
    client_id=environ["AZURE_CLIENT_ID"],
    client_secret=environ["AZURE_CLIENT_SECRET"]
)

client = PurviewCatalogClient(
    endpoint=f"https://{account_name}.purview.azure.com",
    credential=credential
)

result = client.discovery.query({})

for item in result["value"]:
    print(
        f'{item.get("entityType", item.get("objectType"))}: {item["name"]} ({item.get("displayText", "")})'
    )
    print(item.get("description", "No description"))
    print("")

full code and instructions here: https://github.com/MatMoore/azure-purview-experiments

There are additional APIs for Entity level and lineage information. I haven't explored this, so if moving forwards with this we will need to do a mapping exercise between Purview and Datahub metadata.

There is also separate Atlas APIs which may be easier to work with. There is a separate python library for this..

Something like this could be used as part of a one-off ingestion, a custom ingestion source for Datahub, or as an extractor tool that is decoupled from our service (related spike: https://github.com/ministryofjustice/data-catalogue-metadata/issues/14).

jemnery commented 2 weeks ago

I suppose the "keys to the kingdom" argument depends on whether our API secret would grant access to just metadata (acceptable risk) or if we're granted access to their Azure data platform.

From the above it looks like we're ruling out bypassing Purview and ingesting from Azure? I think that's sensible, so that

MatMoore commented 2 weeks ago

@jemnery I haven't completely ruled it out, but I definitely agree that ingesting from Purview is preferable to bypassing it, for those same reasons. When testing this I set up a token with just read access to Purview, so there was no way for that credential to alter anything or expose any data.

The advantage of ingesting direct from blob storage or SQL databases is that its possible in Datahub out of the box, but it just seems like that approach would become impossible to manage centrally as more data assets are added (both credential management and recipe configuration would need input from HMCTS)