ministryofjustice / find-moj-data

Find MOJ data service • This repository is defined and managed in Terraform
MIT License
5 stars 0 forks source link

Assign domains to entities in CaDeT #343

Closed seanprivett closed 1 month ago

seanprivett commented 2 months ago

We would like the domains in the CaDeT metadata to be assigned to actual DataHub domains, rather than custom properties

https://datahubproject.io/docs/generated/ingestion/sources/dbt/#dbt-meta-automated-mappings

murdo-moj commented 2 months ago

A thread in slack suggests using transformers to achieve this end, as the native dbt mappings don't have an add_domain utility

https://datahubspace.slack.com/archives/CUMUWQU66/p1674149180727029

https://datahubproject.io/docs/metadata-ingestion/docs/transformer/dataset_transformer/#domain-mapping-based-on-tags

murdo-moj commented 2 months ago

There's an active feature request for

Add ability to specify domain when Ingest DBT metadata

murdo-moj commented 2 months ago

I have this working with using the naming of tables to assign domains.

murdo-moj commented 2 months ago
source:
    type: dbt
    config:
        manifest_path: 's3://mojap-derived-tables/prod/run_artefacts/latest/target/manifest.json'
        catalog_path: 's3://mojap-derived-tables/prod/run_artefacts/latest/target/catalog.json'
        test_results_path: 's3://mojap-derived-tables/prod/run_artefacts/latest/target/run_results.json'
        target_platform: athena
        infer_dbt_schemas: true
        aws_connection:
            aws_region: eu-west-1
        node_name_pattern:
            allow:
                - '.*bold_sm_spells.*'
                - '.*common_platform.*'
                - '.*sirius.*'
        entities_enabled:
            test_results: 'YES'
            seeds: 'YES'
            snapshots: 'YES'
            models: 'YES'
            sources: 'YES'
            test_definitions: 'YES'
        stateful_ingestion:
            remove_stale_metadata: true

transformers:
    - type: "pattern_add_dataset_domain"
      config:
        semantics: OVERWRITE
        domain_pattern:
          rules:
            'urn:li:dataset:\(urn:li:dataPlatform:dbt,awsdatacatalog.*common_platform.*': ["HMCTS"]
            'urn:li:dataset:\(urn:li:dataPlatform:dbt,awsdatacatalog.*prison.*': ["HMPPS"]
            'urn:li:dataset:\(urn:li:dataPlatform:dbt,awsdatacatalog.*sirius.*': ["OPG"]
murdo-moj commented 2 months ago

This recipe is included in https://github.com/ministryofjustice/data-catalogue/issues/123

murdo-moj commented 2 months ago

Matt did a spike to pick up domains from CaDeT, ~from which we'd then map to our own domain model.~ https://github.com/ministryofjustice/find-moj-data/issues/108

murdo-moj commented 1 month ago

https://github.com/ministryofjustice/data-catalogue/pull/138