Closed YvanMOJdigital closed 1 week ago
Purview API: https://learn.microsoft.com/en-us/rest/api/purview/
Seems like this is part of the Azure Resource Manager API and we can use https://github.com/AzureAD/microsoft-authentication-library-for-python to handle the auth.
Since we're running the ingestion non-interactively we'll need to use the client credentials flow.
Client registration steps are documented here: https://learn.microsoft.com/en-us/rest/api/azure/#client-registration
Breaking this down a bit...
We have two options:
Also, each of these could be a push or a pull
@murdo-moj is talking to HMCTS about how they are using Purview and what data sources they actually have. There is a risk they might not continue using Purview.
Next steps:
Exporting metadata is only supported for Business Assets, which is a bit useless. https://techcommunity.microsoft.com/t5/security-compliance-and-identity/now-in-preview-export-your-business-assets-from-microsoft/ba-p/3859055
You can also export glossary terms.
Seems like the only way to get databases, tables. files etc is via their API https://learn.microsoft.com/en-us/rest/api/purview/
Murdo and I met with HMCTS devs for their purview catalogue.
We are getting purview access for all our engineers - We can test the api via our users, but i think we'll need another meeting to discuss the feasibility of setting up a client for us to access purview as there was some confusion around this as a possibility
Yes this is possible using its Discovery API to list entities
Here's my proof of concept using the Azure SDK:
import sys
from os import environ
from azure.identity import ClientSecretCredential
from azure.purview.catalog import PurviewCatalogClient
account_name = environ["ACCOUNT_NAME"]
credential = ClientSecretCredential(
tenant_id=environ["AZURE_TENANT_ID"],
client_id=environ["AZURE_CLIENT_ID"],
client_secret=environ["AZURE_CLIENT_SECRET"]
)
client = PurviewCatalogClient(
endpoint=f"https://{account_name}.purview.azure.com",
credential=credential
)
result = client.discovery.query({})
for item in result["value"]:
print(
f'{item.get("entityType", item.get("objectType"))}: {item["name"]} ({item.get("displayText", "")})'
)
print(item.get("description", "No description"))
print("")
full code and instructions here: https://github.com/MatMoore/azure-purview-experiments
There are additional APIs for Entity level and lineage information. I haven't explored this, so if moving forwards with this we will need to do a mapping exercise between Purview and Datahub metadata.
There is also separate Atlas APIs which may be easier to work with. There is a separate python library for this..
Something like this could be used as part of a one-off ingestion, a custom ingestion source for Datahub, or as an extractor tool that is decoupled from our service (related spike: https://github.com/ministryofjustice/data-catalogue-metadata/issues/14).
I suppose the "keys to the kingdom" argument depends on whether our API secret would grant access to just metadata (acceptable risk) or if we're granted access to their Azure data platform.
From the above it looks like we're ruling out bypassing Purview and ingesting from Azure? I think that's sensible, so that
@jemnery I haven't completely ruled it out, but I definitely agree that ingesting from Purview is preferable to bypassing it, for those same reasons. When testing this I set up a token with just read access to Purview, so there was no way for that credential to alter anything or expose any data.
The advantage of ingesting direct from blob storage or SQL databases is that its possible in Datahub out of the box, but it just seems like that approach would become impossible to manage centrally as more data assets are added (both credential management and recipe configuration would need input from HMCTS)
Describe the feature request.
Describe the potential process, poc if possible.
Describe the context.
Gustav Moller Heya Yvan. I had a chat with Jeremy + some colleagues from HMCTS yesterday about the MoJ Data Catalogue, and how to populate it with information from the HMCTS Purview Data Catalogue. At the end of the call we agreed to have another call in a month or so. Jeremy said the team should be able to present some idea around how to add data from HMCTS Purview into the MoJ Data Catalogue. Does that work for you? Next call would probably be 18th or 25th September
Notes
May be dependent on https://github.com/ministryofjustice/data-catalogue/issues/250 that adds Azure Data Storage (ADS) support for DataHub.
Value / Purpose
No response
User Types
No response