ministryofjustice / find-moj-data

Find MOJ data service • This repository is defined and managed in Terraform
MIT License
5 stars 0 forks source link

Spike: Identify the best approach to collect cadet custom metadata (with respect to datahub's dbt ingestion) #363

Closed LavMatt closed 1 month ago

LavMatt commented 2 months ago

User Story

As a developer I want the implemented approach in cadet for collection catalogue specific metadata to be the one best compatible with datahub So that ingestion from cadet (via dbt datahub ingestion) works as simply as is possible

Proposal

In a meeting with DMET members responsible for create-a-derived-table there were a couple of different approaches to collect catalogue specific metadata in cadet. This will need to be understood in a bit more detail following up with some or all of (Soumaya Mauthoor, Ben Waterfield, Jacob Hamblin-Pike, Phil Hall)

From memory the approaches were:

  1. Use dbt groups to collect and organise some of these data
  2. Use a model tagging strategy more wholly

Once clarity has been got for the above approaches we could try mock up some metadata to see how each integrates with the native datahub dbt ingestion

Definition of Done

Example

seanprivett commented 2 months ago

Session with CaDeT users to explore this further - where does it fit into their workflow Is this just a PR we raise for them to approve How do we build verification into CI/CD Review metadata spec with CaDeT users Where would you put the catalogue opt in flag

This sheet shows the items highlighted in green which we are asking CaDeT users to enter manually. These should be the only metadata items we need to add into the workflow

LavMatt commented 2 months ago

I've spoken to Ben Waterfield re. the approaches and they're happy to be led by what is best for the catalogue ingestion (within reason for ease of implementation within cadet). So we need to figure out how well each integrates to present findings back the cadet guys.

I agreed with Ben that we could create some tables in dbt to test the different approaches and integration with the datahub's dbt ingestion

MatMoore commented 2 months ago

Minimum metadata to integrate into the workflow:

For access requirements we are just going to show a standard boilerplate for now, redirecting users to the data engineering access repo.

We also need descriptions to be populated properly, but this is already handled by the dbt ingestion.

See https://docs.google.com/spreadsheets/d/1O1EjO96lyqDbyuzFWU8SNBky3b7tKM8dccZVS7-_t9g/edit#gid=1846443805

An example of what this could look like in CaDeT, if we use the meta block:

meta:
  show_in_catalogue: true
  owner_email: person@justice.gov.uk
  contact:
    - type: slack
      channel: '#abc'

Things to explore

MatMoore commented 2 months ago

Groups example looks like this

groups:
  - name: finance
    owner:
      # 'name' or 'email' is required; additional properties allowed
      email: finance@jaffleshop.com
      slack: finance-data
      github: finance-data-team

https://docs.getdbt.com/docs/build/groups

This conflates data owner/steward/custodian roles, and I'm not sure how this will be parsed by Datahub, so I'm not sure about this approach yet

Can try both groups and meta block approaches out though.

MatMoore commented 2 months ago

whereToAccessDataset - should be set to Analytical Platform

MatMoore commented 2 months ago

Existing groups: https://github.com/moj-analytical-services/create-a-derived-table/blob/d3999ee10bb63a160e8a14a41c8e2f5a3b1f746b/mojap_derived_tables/models/groups.yml#L4

These use slack IDs so we would need to translate that into human readable channels

MatMoore commented 2 months ago

Configs can be inherited in DBT, so perhaps we can apply them at the domain level

https://docs.getdbt.com/reference/configs-and-properties

MatMoore commented 2 months ago

Throwaway code for testing locally: https://github.com/moj-analytical-services/create-a-derived-table/compare/datahub-test

MatMoore commented 2 months ago

Meta mapping findings

https://datahubproject.io/docs/generated/ingestion/sources/dbt/#dbt-meta-automated-mappings

No nested matches

The DBT source's meta mapping config does not support matching nested properties, e.g. matching contact.channel here

  contact:
    - type: slack
      channel: '#abc'

OperationProcessor has an arg for this match_nested_props: bool = False, but this was added for Kafka and not backported to DBT, see https://github.com/datahub-project/datahub/pull/8825

This can be worked around by not using nested yaml, e.g.

slack_contact: '#abc'
teams_contact: '#def'

add_property operation is missing

There is not an add_property operation.

It might be worth looking into how much work this would be to add upstream.

add_owner operation

There IS an add_owner operation. It treats the value as an ID and passes it through mce_builder.make_owner_urn

If we set strip_user_ids_from_email it will strip off the @justice.gov.uk part of the email.

Does it create the user if it doesn't exist?

No!

Conclusions:

It seems that in addition to the explicit mapping, you can also set metadata via a datahub key like this

meta:
  datahub:
    owners:
        - 'urn:...'

and then it will be automatically picked up by the meta mapping functionality.

MatMoore commented 2 months ago

Group findings

Group information doesn't seem to be picked up by datahub. It's not propagated to the custom properties, and datasets ingested with a group set do not have an owner.

This information is not present in the catalogue.json. It is in the manifest.json, but there are separate objects for the groups, e.g. "group.mojap_derived_tables.risk", and I see no reference to groups in the code of the dbt ingestion source.

This seems like a dead end, and we will need to make use of meta and/or tags instead, even if it means duplicating some metadata 🙁

MatMoore commented 2 months ago

add_owner doesn't seem to work in my initial test so I need to debug this

MatMoore commented 2 months ago

Maybe of interest - this can be used to enforce documentation coverage in dbt https://github.com/tnightengale/dbt-meta-testing

MatMoore commented 2 months ago

DBT docs is picking up the meta fields when they are in model files like mojap_derived_tables/models/courts/common_platform_derived/common_platform_derived__all_offence_fct.yml, but not within the models mapping in the dbt_project.yaml

I think I might need to do +meta instead of meta

MatMoore commented 2 months ago

Update CaDeT task

Recommendation for dbt_project.yaml

    general:
      +meta:
       dc_slack_channel: '#data-engineering'
       dc_owner_id: FirstName.LastName
       dc_where_to_access_dataset: AnalyticalPlatform

Recommendation for model files

Update ingestion recipe task

Add the following:

    strip_user_ids_from_email: true
    tag_prefix: ""
    meta_mapping:
      owner_email:
        match: '.*'
        operation: 'add_owner'
        config:
          owner_type: user
          owner_category: DATAOWNER

Note: anything in meta by default gets included in the custom properties.

MatMoore commented 2 months ago

Looking into what it would take to add a custom property operation to meta mapping. I'm checking in with Datahub slack before attempting this. If we don't implement this, then a possible workaround is be to create a unique tag for each slack channel, e.g. dc:slack-channel:data-catalogue.


OperationProcessor.process returns an aspect map - a dict mapping operation name to a resulting aspect definition. We would need to add a new operation type and allow DatasetProperties aspect values in the map.

This is processed like this

            meta_aspects: Dict[str, Any] = meta_aspects = action_processor.process(node.meta)

            aspects = self._generate_base_dbt_aspects(
                node, additional_custom_props_filtered, mce_platform, meta_aspects
            )

            dataset_snapshot = DatasetSnapshot(urn=node_datahub_urn, aspects=aspects)
            mce = MetadataChangeEvent(proposedSnapshot=dataset_snapshot)
            if self.config.write_semantics == "PATCH":
                mce = self.get_patched_mce(mce)

            yield MetadataWorkUnit(id=dataset_snapshot.urn, mce=mce)

Option 1: Patch semantics

The other operations all have patch semantics, so maybe we should have a add_custom_property that works the same way?

get_patched_mce has cases for owners, tags, terms - presumably this needs to be extended if we want the same behaviour for custom properties.

meta_mapping complements the properties added in _generate_base_dbt_aspects - one of them needs to take precedence

If there is no match on the meta field, then intuitively we might expect the property to be removed. But there is no mechanism for this if we follow patch semantics. (In our use case this wouldn't matter though)

(There is an existing class CustomPropertiesPatchHelper, exposed via DatasetPatchBuilder, that may be relevant)

Option 2: Overwrite semantics

The other option is if the custom properties generated this way completely overwrite existing custom properties.

This would make it consistent with the way custom properties are handled already (the ones that are set by the ingestion by default)

This would also mean that properties added in previous runs don't stick around forever.

They would still need to preserve the properties added via _generate_base_dbt_aspects.

MatMoore commented 2 months ago

Two more configs I haven't tried

enable_owner_extraction: bool = Field(
    default=True,
    description="When enabled, ownership info will be extracted from the dbt meta",
)
owner_extraction_pattern: Optional[str] = Field(
    default=None,
    description='Regex string to extract owner from the dbt node using the `(?P<name>...) syntax` of the [match object](https://docs.python.org/3/library/re.html#match-objects), where the group name must be `owner`. Examples: (1)`r"(?P<owner>(.*)): (\\w+) (\\w+)"` will extract `jdoe` as the owner from `"jdoe: John Doe"` (2) `r"@(?P<owner>(.*))"` will extract `alice` as the owner from `"@alice"`.',
)

This is overwritten by meta mapping in _aggregate_owners. I think we're better off using our own meta property though because CaDeT currently users owners to refer to something more akin to a data custodian or technical owner. Whereas we want to capture a named individual who will be the business owner.

MatMoore commented 2 months ago

What does the add_owner meta mapping actually do?

It creates the following aspect:

owner_aspect = OwnershipClass(
                owners=[
                    OwnerClass(
                        owner=x.get("urn"),
                        type=x.get("category"),
                        typeUrn=x.get("categoryUrn"),
                        source=(
                            OwnershipSourceClass(type=self.owner_source_type)
                            if self.owner_source_type
                            else None
                        ),
                    )
                    for x in sorted(
                        operation_map[Constants.ADD_OWNER_OPERATION],
                        key=lambda x: x["urn"],
                    )
                ]
            )

There's another example here that works in a similar way https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/dataset_add_owner.py

Neither of these create the CorpUser entity - it has to already exist.

Users can be created from YAML with the user CLI https://datahubproject.io/docs/cli/#entity-specific-commands

From alpha we also have this JSON import https://github.com/ministryofjustice/data-catalogue-metadata/blob/main/ownership/users.json

Recommendation: When adding a new owner, CaDeT users should do one of the following:

  1. Log in to Datahub via Entra ID to automatically create their user
  2. Ask us to add the user to our pre-created user YAML/JSON

We could go with option 2 initially and then require option 1 later on.

MatMoore commented 2 months ago

How slack linking works

https://api.slack.com/docs/deep-linking

Either we ask for the name and redirect like so https://api.slack.com/reference/deep-linking#app_channel

Or we ask for the name and ID and generate a slack:// link https://api.slack.com/reference/deep-linking#open_a_channel

I think the latter assumes the user has a slack client installed, so the former would be more reliable

We should ignore the IDs already in the groups.yaml - since these are for notifications rather than finding help.

There are 3 channels at the moment:

data-engineering-alerts-prod C01H895893K

hmpps-data-notifications C062BRE6LG3

opg-deds C04T41W0M8B

But we should use corresponding ask- channels for the catalogue, e.g. ask-data-engineering

We will ignore MS teams for now as we're not aware of anyone using this for a contact channel.

MatMoore commented 2 months ago

Next steps: write up a runbook page documenting how this should be captured, then do a final check the metadata comes through as expected.

MatMoore commented 2 months ago

Runbook PR: https://github.com/ministryofjustice/data-catalogue-runbooks/pull/31

And this is a draft PR to add some initial metadata: https://github.com/moj-analytical-services/create-a-derived-table/pull/1568/files

Before merging that we will need to speak to the data engineers and