Closed LavMatt closed 1 month ago
Session with CaDeT users to explore this further - where does it fit into their workflow Is this just a PR we raise for them to approve How do we build verification into CI/CD Review metadata spec with CaDeT users Where would you put the catalogue opt in flag
This sheet shows the items highlighted in green which we are asking CaDeT users to enter manually. These should be the only metadata items we need to add into the workflow
I've spoken to Ben Waterfield re. the approaches and they're happy to be led by what is best for the catalogue ingestion (within reason for ease of implementation within cadet). So we need to figure out how well each integrates to present findings back the cadet guys.
I agreed with Ben that we could create some tables in dbt to test the different approaches and integration with the datahub's dbt ingestion
Minimum metadata to integrate into the workflow:
For access requirements we are just going to show a standard boilerplate for now, redirecting users to the data engineering access repo.
We also need descriptions to be populated properly, but this is already handled by the dbt ingestion.
An example of what this could look like in CaDeT, if we use the meta
block:
meta:
show_in_catalogue: true
owner_email: person@justice.gov.uk
contact:
- type: slack
channel: '#abc'
Things to explore
Groups example looks like this
groups:
- name: finance
owner:
# 'name' or 'email' is required; additional properties allowed
email: finance@jaffleshop.com
slack: finance-data
github: finance-data-team
https://docs.getdbt.com/docs/build/groups
This conflates data owner/steward/custodian roles, and I'm not sure how this will be parsed by Datahub, so I'm not sure about this approach yet
Can try both groups and meta block approaches out though.
whereToAccessDataset - should be set to Analytical Platform
Existing groups: https://github.com/moj-analytical-services/create-a-derived-table/blob/d3999ee10bb63a160e8a14a41c8e2f5a3b1f746b/mojap_derived_tables/models/groups.yml#L4
These use slack IDs so we would need to translate that into human readable channels
Configs can be inherited in DBT, so perhaps we can apply them at the domain level
Throwaway code for testing locally: https://github.com/moj-analytical-services/create-a-derived-table/compare/datahub-test
https://datahubproject.io/docs/generated/ingestion/sources/dbt/#dbt-meta-automated-mappings
The DBT source's meta mapping config does not support matching nested properties, e.g. matching contact.channel
here
contact:
- type: slack
channel: '#abc'
OperationProcessor
has an arg for this match_nested_props: bool = False
, but this was added for Kafka and not backported to DBT, see https://github.com/datahub-project/datahub/pull/8825
This can be worked around by not using nested yaml, e.g.
slack_contact: '#abc'
teams_contact: '#def'
There is not an add_property
operation.
It might be worth looking into how much work this would be to add upstream.
There IS an add_owner operation. It treats the value as an ID and passes it through mce_builder.make_owner_urn
If we set strip_user_ids_from_email
it will strip off the @justice.gov.uk part of the email.
Does it create the user if it doesn't exist?
No!
Conclusions:
It seems that in addition to the explicit mapping, you can also set metadata via a datahub
key like this
meta:
datahub:
owners:
- 'urn:...'
and then it will be automatically picked up by the meta mapping functionality.
Group information doesn't seem to be picked up by datahub. It's not propagated to the custom properties, and datasets ingested with a group set do not have an owner.
This information is not present in the catalogue.json. It is in the manifest.json, but there are separate objects for the groups, e.g. "group.mojap_derived_tables.risk", and I see no reference to groups in the code of the dbt ingestion source.
This seems like a dead end, and we will need to make use of meta
and/or tags
instead, even if it means duplicating some metadata 🙁
add_owner doesn't seem to work in my initial test so I need to debug this
Maybe of interest - this can be used to enforce documentation coverage in dbt https://github.com/tnightengale/dbt-meta-testing
DBT docs is picking up the meta fields when they are in model files like mojap_derived_tables/models/courts/common_platform_derived/common_platform_derived__all_offence_fct.yml
, but not within the models
mapping in the dbt_project.yaml
I think I might need to do +meta
instead of meta
Recommendation for dbt_project.yaml
general:
+meta:
dc_slack_channel: '#data-engineering'
dc_owner_id: FirstName.LastName
dc_where_to_access_dataset: AnalyticalPlatform
Recommendation for model files
dc_display_in_catalogue
tag for final models, including dimensions and fact tablesAdd the following:
strip_user_ids_from_email: true
tag_prefix: ""
meta_mapping:
owner_email:
match: '.*'
operation: 'add_owner'
config:
owner_type: user
owner_category: DATAOWNER
Note: anything in meta by default gets included in the custom properties.
Looking into what it would take to add a custom property operation to meta mapping. I'm checking in with Datahub slack before attempting this. If we don't implement this, then a possible workaround is be to create a unique tag for each slack channel, e.g. dc:slack-channel:data-catalogue
.
OperationProcessor.process returns an aspect map - a dict mapping operation name to a resulting aspect definition. We would need to add a new operation type and allow DatasetProperties aspect values in the map.
This is processed like this
meta_aspects: Dict[str, Any] = meta_aspects = action_processor.process(node.meta)
aspects = self._generate_base_dbt_aspects(
node, additional_custom_props_filtered, mce_platform, meta_aspects
)
dataset_snapshot = DatasetSnapshot(urn=node_datahub_urn, aspects=aspects)
mce = MetadataChangeEvent(proposedSnapshot=dataset_snapshot)
if self.config.write_semantics == "PATCH":
mce = self.get_patched_mce(mce)
yield MetadataWorkUnit(id=dataset_snapshot.urn, mce=mce)
The other operations all have patch semantics, so maybe we should have a add_custom_property
that works the same way?
get_patched_mce
has cases for owners, tags, terms - presumably this needs to be extended if we want the same behaviour for custom properties.
meta_mapping complements the properties added in _generate_base_dbt_aspects - one of them needs to take precedence
If there is no match on the meta field, then intuitively we might expect the property to be removed. But there is no mechanism for this if we follow patch semantics. (In our use case this wouldn't matter though)
(There is an existing class CustomPropertiesPatchHelper, exposed via DatasetPatchBuilder, that may be relevant)
The other option is if the custom properties generated this way completely overwrite existing custom properties.
This would make it consistent with the way custom properties are handled already (the ones that are set by the ingestion by default)
This would also mean that properties added in previous runs don't stick around forever.
They would still need to preserve the properties added via _generate_base_dbt_aspects.
Two more configs I haven't tried
enable_owner_extraction: bool = Field(
default=True,
description="When enabled, ownership info will be extracted from the dbt meta",
)
owner_extraction_pattern: Optional[str] = Field(
default=None,
description='Regex string to extract owner from the dbt node using the `(?P<name>...) syntax` of the [match object](https://docs.python.org/3/library/re.html#match-objects), where the group name must be `owner`. Examples: (1)`r"(?P<owner>(.*)): (\\w+) (\\w+)"` will extract `jdoe` as the owner from `"jdoe: John Doe"` (2) `r"@(?P<owner>(.*))"` will extract `alice` as the owner from `"@alice"`.',
)
This is overwritten by meta mapping in _aggregate_owners
. I think we're better off using our own meta property though because CaDeT currently users owners to refer to something more akin to a data custodian or technical owner. Whereas we want to capture a named individual who will be the business owner.
What does the add_owner meta mapping actually do?
It creates the following aspect:
owner_aspect = OwnershipClass(
owners=[
OwnerClass(
owner=x.get("urn"),
type=x.get("category"),
typeUrn=x.get("categoryUrn"),
source=(
OwnershipSourceClass(type=self.owner_source_type)
if self.owner_source_type
else None
),
)
for x in sorted(
operation_map[Constants.ADD_OWNER_OPERATION],
key=lambda x: x["urn"],
)
]
)
There's another example here that works in a similar way https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/dataset_add_owner.py
Neither of these create the CorpUser
entity - it has to already exist.
Users can be created from YAML with the user CLI https://datahubproject.io/docs/cli/#entity-specific-commands
From alpha we also have this JSON import https://github.com/ministryofjustice/data-catalogue-metadata/blob/main/ownership/users.json
Recommendation: When adding a new owner, CaDeT users should do one of the following:
We could go with option 2 initially and then require option 1 later on.
How slack linking works
https://api.slack.com/docs/deep-linking
Either we ask for the name and redirect like so https://api.slack.com/reference/deep-linking#app_channel
Or we ask for the name and ID and generate a slack:// link https://api.slack.com/reference/deep-linking#open_a_channel
I think the latter assumes the user has a slack client installed, so the former would be more reliable
We should ignore the IDs already in the groups.yaml - since these are for notifications rather than finding help.
There are 3 channels at the moment:
data-engineering-alerts-prod C01H895893K
hmpps-data-notifications C062BRE6LG3
opg-deds C04T41W0M8B
But we should use corresponding ask- channels for the catalogue, e.g. ask-data-engineering
We will ignore MS teams for now as we're not aware of anyone using this for a contact channel.
Next steps: write up a runbook page documenting how this should be captured, then do a final check the metadata comes through as expected.
Runbook PR: https://github.com/ministryofjustice/data-catalogue-runbooks/pull/31
And this is a draft PR to add some initial metadata: https://github.com/moj-analytical-services/create-a-derived-table/pull/1568/files
Before merging that we will need to speak to the data engineers and
dc_display_in_catalogue
to make sure we're making the right stuff visible/not visible
User Story
As a developer I want the implemented approach in cadet for collection catalogue specific metadata to be the one best compatible with datahub So that ingestion from cadet (via dbt datahub ingestion) works as simply as is possible
Proposal
In a meeting with DMET members responsible for create-a-derived-table there were a couple of different approaches to collect catalogue specific metadata in cadet. This will need to be understood in a bit more detail following up with some or all of (Soumaya Mauthoor, Ben Waterfield, Jacob Hamblin-Pike, Phil Hall)
From memory the approaches were:
Once clarity has been got for the above approaches we could try mock up some metadata to see how each integrates with the native datahub dbt ingestion
Definition of Done
Example