Spike: Identify the best approach to collect cadet custom metadata (with respect to datahub's dbt ingestion)

LavMatt commented 2 months ago

User Story

As a developer I want the implemented approach in cadet for collection catalogue specific metadata to be the one best compatible with datahub So that ingestion from cadet (via dbt datahub ingestion) works as simply as is possible

Proposal

In a meeting with DMET members responsible for create-a-derived-table there were a couple of different approaches to collect catalogue specific metadata in cadet. This will need to be understood in a bit more detail following up with some or all of (Soumaya Mauthoor, Ben Waterfield, Jacob Hamblin-Pike, Phil Hall)

From memory the approaches were:

Use dbt groups to collect and organise some of these data
Use a model tagging strategy more wholly

Once clarity has been got for the above approaches we could try mock up some metadata to see how each integrates with the native datahub dbt ingestion

Definition of Done

Example

[ ] Documentation has been written / updated
[ ] README has been updated
[ ] User docs have been updated
[ ] Another team member has reviewed
[ ] Tests are green

seanprivett commented 2 months ago

Session with CaDeT users to explore this further - where does it fit into their workflow Is this just a PR we raise for them to approve How do we build verification into CI/CD Review metadata spec with CaDeT users Where would you put the catalogue opt in flag

This sheet shows the items highlighted in green which we are asking CaDeT users to enter manually. These should be the only metadata items we need to add into the workflow

LavMatt commented 2 months ago

I've spoken to Ben Waterfield re. the approaches and they're happy to be led by what is best for the catalogue ingestion (within reason for ease of implementation within cadet). So we need to figure out how well each integrates to present findings back the cadet guys.

I agreed with Ben that we could create some tables in dbt to test the different approaches and integration with the datahub's dbt ingestion

MatMoore commented 2 months ago

Minimum metadata to integrate into the workflow:

Owner
Team contact
~~Access requirements~~
An explicit opt-in to indicate what should be visible to users of the service

For access requirements we are just going to show a standard boilerplate for now, redirecting users to the data engineering access repo.

We also need descriptions to be populated properly, but this is already handled by the dbt ingestion.

See https://docs.google.com/spreadsheets/d/1O1EjO96lyqDbyuzFWU8SNBky3b7tKM8dccZVS7-_t9g/edit#gid=1846443805

An example of what this could look like in CaDeT, if we use the meta block:

meta:
  show_in_catalogue: true
  owner_email: person@justice.gov.uk
  contact:
    - type: slack
      channel: '#abc'

Things to explore

[x] Can we ingest metadata of this form and translate it into custom properties and tags? (We are assuming we can do this with transformers)
[x] Can we make use of the DBT groups information, instead of requiring it to be duplicated in the meta block? If so, how do we translate the slack IDs into something usable?
[ ] Do we need to introduce additional checks into CaDeT workflows? What happens if the metadata is malformed?
[ ] Can we dynamically apply the "well documented" tag or "needs documentation" etc based on missing metadata?

MatMoore commented 2 months ago

Groups example looks like this

groups:
  - name: finance
    owner:
      # 'name' or 'email' is required; additional properties allowed
      email: finance@jaffleshop.com
      slack: finance-data
      github: finance-data-team

https://docs.getdbt.com/docs/build/groups

This conflates data owner/steward/custodian roles, and I'm not sure how this will be parsed by Datahub, so I'm not sure about this approach yet

Can try both groups and meta block approaches out though.

MatMoore commented 2 months ago

whereToAccessDataset - should be set to Analytical Platform

MatMoore commented 2 months ago

Existing groups: https://github.com/moj-analytical-services/create-a-derived-table/blob/d3999ee10bb63a160e8a14a41c8e2f5a3b1f746b/mojap_derived_tables/models/groups.yml#L4

These use slack IDs so we would need to translate that into human readable channels

MatMoore commented 2 months ago

Configs can be inherited in DBT, so perhaps we can apply them at the domain level

https://docs.getdbt.com/reference/configs-and-properties

MatMoore commented 2 months ago

Throwaway code for testing locally: https://github.com/moj-analytical-services/create-a-derived-table/compare/datahub-test

MatMoore commented 2 months ago

Meta mapping findings

https://datahubproject.io/docs/generated/ingestion/sources/dbt/#dbt-meta-automated-mappings

No nested matches

The DBT source's meta mapping config does not support matching nested properties, e.g. matching contact.channel here

  contact:
    - type: slack
      channel: '#abc'

OperationProcessor has an arg for this match_nested_props: bool = False, but this was added for Kafka and not backported to DBT, see https://github.com/datahub-project/datahub/pull/8825

This can be worked around by not using nested yaml, e.g.

slack_contact: '#abc'
teams_contact: '#def'

add_property operation is missing

There is not an add_property operation.

It might be worth looking into how much work this would be to add upstream.

add_owner operation

There IS an add_owner operation. It treats the value as an ID and passes it through mce_builder.make_owner_urn

If we set strip_user_ids_from_email it will strip off the @justice.gov.uk part of the email.

Does it create the user if it doesn't exist?

No!

The email address will be converted to a URN, e.g. urn:li:corpuser:mat.moore
This is set as the owner of the entity, but if you click on it in datahub UI, you get a 404
If the user does exist, this relationship works as expected
URNs are case sensitive, so I need to enter Mat.Moore for it to be associated with my user

Conclusions:

We should ask CaDeT users to log into datahub before setting themselves as an owner in DBT
Until they log in, we will not have enough information to render the ownership information properly (e.g. no first name/last name)
It's better to explicitly ask for for the datahub ID rather than inferring it from an email. Otherwise users will naturally expect it to be case insensitive, may confuse justice emails with digital.justice emails etc.

It seems that in addition to the explicit mapping, you can also set metadata via a datahub key like this

meta:
  datahub:
    owners:
        - 'urn:...'

and then it will be automatically picked up by the meta mapping functionality.

MatMoore commented 2 months ago

Group findings

Group information doesn't seem to be picked up by datahub. It's not propagated to the custom properties, and datasets ingested with a group set do not have an owner.

This information is not present in the catalogue.json. It is in the manifest.json, but there are separate objects for the groups, e.g. "group.mojap_derived_tables.risk", and I see no reference to groups in the code of the dbt ingestion source.

This seems like a dead end, and we will need to make use of meta and/or tags instead, even if it means duplicating some metadata 🙁

MatMoore commented 2 months ago

add_owner doesn't seem to work in my initial test so I need to debug this

is the meta picked up correctly in dbt?
does it matter where the meta is defined? (model vs group of models)
does the owner ID need to match an existing owner?
is it case sensitive?

MatMoore commented 2 months ago

Maybe of interest - this can be used to enforce documentation coverage in dbt https://github.com/tnightengale/dbt-meta-testing

MatMoore commented 2 months ago

DBT docs is picking up the meta fields when they are in model files like mojap_derived_tables/models/courts/common_platform_derived/common_platform_derived__all_offence_fct.yml, but not within the models mapping in the dbt_project.yaml

I think I might need to do +meta instead of meta

MatMoore commented 2 months ago

Update CaDeT task

Recommendation for dbt_project.yaml

    general:
      +meta:
       dc_slack_channel: '#data-engineering'
       dc_owner_id: FirstName.LastName
       dc_where_to_access_dataset: AnalyticalPlatform

The owner ID is case sensitive - check it matches the URL of your datahub profile page
If the owner has not logged into datahub, we will either hide the owner information, or only show an email (the name won't be available)
We're assuming slack and teams are the two ways of communicating with a team
The metadata can be set at any level of the DBT project structure (e.g. we can set it per domain)
dc_where_to_access_dataset is an enum. We'll hard code specific guidance in the frontend.

Recommendation for model files

Set dc_display_in_catalogue tag for final models, including dimensions and fact tables
Ignore staging tables
Include common lookup tables in general

Update ingestion recipe task

Add the following:

    strip_user_ids_from_email: true
    tag_prefix: ""
    meta_mapping:
      owner_email:
        match: '.*'
        operation: 'add_owner'
        config:
          owner_type: user
          owner_category: DATAOWNER

Note: anything in meta by default gets included in the custom properties.

MatMoore commented 2 months ago

Looking into what it would take to add a custom property operation to meta mapping. I'm checking in with Datahub slack before attempting this. If we don't implement this, then a possible workaround is be to create a unique tag for each slack channel, e.g. dc:slack-channel:data-catalogue.

OperationProcessor.process returns an aspect map - a dict mapping operation name to a resulting aspect definition. We would need to add a new operation type and allow DatasetProperties aspect values in the map.

This is processed like this

            meta_aspects: Dict[str, Any] = meta_aspects = action_processor.process(node.meta)

            aspects = self._generate_base_dbt_aspects(
                node, additional_custom_props_filtered, mce_platform, meta_aspects
            )

            dataset_snapshot = DatasetSnapshot(urn=node_datahub_urn, aspects=aspects)
            mce = MetadataChangeEvent(proposedSnapshot=dataset_snapshot)
            if self.config.write_semantics == "PATCH":
                mce = self.get_patched_mce(mce)

            yield MetadataWorkUnit(id=dataset_snapshot.urn, mce=mce)

Option 1: Patch semantics

The other operations all have patch semantics, so maybe we should have a add_custom_property that works the same way?

get_patched_mce has cases for owners, tags, terms - presumably this needs to be extended if we want the same behaviour for custom properties.

meta_mapping complements the properties added in _generate_base_dbt_aspects - one of them needs to take precedence

If there is no match on the meta field, then intuitively we might expect the property to be removed. But there is no mechanism for this if we follow patch semantics. (In our use case this wouldn't matter though)

(There is an existing class CustomPropertiesPatchHelper, exposed via DatasetPatchBuilder, that may be relevant)

Option 2: Overwrite semantics

The other option is if the custom properties generated this way completely overwrite existing custom properties.

This would make it consistent with the way custom properties are handled already (the ones that are set by the ingestion by default)

This would also mean that properties added in previous runs don't stick around forever.

They would still need to preserve the properties added via _generate_base_dbt_aspects.

MatMoore commented 2 months ago

Two more configs I haven't tried

enable_owner_extraction: bool = Field(
    default=True,
    description="When enabled, ownership info will be extracted from the dbt meta",
)
owner_extraction_pattern: Optional[str] = Field(
    default=None,
    description='Regex string to extract owner from the dbt node using the `(?P<name>...) syntax` of the [match object](https://docs.python.org/3/library/re.html#match-objects), where the group name must be `owner`. Examples: (1)`r"(?P<owner>(.*)): (\\w+) (\\w+)"` will extract `jdoe` as the owner from `"jdoe: John Doe"` (2) `r"@(?P<owner>(.*))"` will extract `alice` as the owner from `"@alice"`.',
)

This is overwritten by meta mapping in _aggregate_owners. I think we're better off using our own meta property though because CaDeT currently users owners to refer to something more akin to a data custodian or technical owner. Whereas we want to capture a named individual who will be the business owner.

MatMoore commented 2 months ago

What does the add_owner meta mapping actually do?

It creates the following aspect:

owner_aspect = OwnershipClass(
                owners=[
                    OwnerClass(
                        owner=x.get("urn"),
                        type=x.get("category"),
                        typeUrn=x.get("categoryUrn"),
                        source=(
                            OwnershipSourceClass(type=self.owner_source_type)
                            if self.owner_source_type
                            else None
                        ),
                    )
                    for x in sorted(
                        operation_map[Constants.ADD_OWNER_OPERATION],
                        key=lambda x: x["urn"],
                    )
                ]
            )

There's another example here that works in a similar way https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/dataset_add_owner.py

Neither of these create the CorpUser entity - it has to already exist.

Users can be created from YAML with the user CLI https://datahubproject.io/docs/cli/#entity-specific-commands

From alpha we also have this JSON import https://github.com/ministryofjustice/data-catalogue-metadata/blob/main/ownership/users.json

Recommendation: When adding a new owner, CaDeT users should do one of the following:

Log in to Datahub via Entra ID to automatically create their user
Ask us to add the user to our pre-created user YAML/JSON

We could go with option 2 initially and then require option 1 later on.

MatMoore commented 2 months ago

How slack linking works

https://api.slack.com/docs/deep-linking

Either we ask for the name and redirect like so https://api.slack.com/reference/deep-linking#app_channel

Or we ask for the name and ID and generate a slack:// link https://api.slack.com/reference/deep-linking#open_a_channel

I think the latter assumes the user has a slack client installed, so the former would be more reliable

We should ignore the IDs already in the groups.yaml - since these are for notifications rather than finding help.

There are 3 channels at the moment:

data-engineering-alerts-prod C01H895893K

hmpps-data-notifications C062BRE6LG3

opg-deds C04T41W0M8B

But we should use corresponding ask- channels for the catalogue, e.g. ask-data-engineering

We will ignore MS teams for now as we're not aware of anyone using this for a contact channel.

MatMoore commented 2 months ago

Next steps: write up a runbook page documenting how this should be captured, then do a final check the metadata comes through as expected.

MatMoore commented 2 months ago

Runbook PR: https://github.com/ministryofjustice/data-catalogue-runbooks/pull/31

And this is a draft PR to add some initial metadata: https://github.com/moj-analytical-services/create-a-derived-table/pull/1568/files

Before merging that we will need to speak to the data engineers and

Populate the owners for each domain
Make sure the owner exists in datahub
Review the models (and sources?) tagged with dc_display_in_catalogue to make sure we're making the right stuff visible/not visible

ministryofjustice / find-moj-data