ministryofjustice / find-moj-data

Find MOJ data service • This repository is defined and managed in Terraform
MIT License
5 stars 0 forks source link

Filter out CaDeT Athena entities from FMD front end #351

Closed seanprivett closed 4 months ago

seanprivett commented 4 months ago

User Story

As a user of the data catalogue I expect to not be confused/overwhelmed by duplicate entities So i have a simple search experience

Context

Currently find-moj-data treats datahub dbt models/sources/seeds as Tables and also treats Athena Datasets as tables. This means search results present what appear as duplicate entities

Proposal

We only return the dbt entity type (be that model, seed or source) from datahub in any search queries and filter out the assocaited athena entity from all results - This can remain as a table entity for now but will likely need some input from UR around how we best classify dbt data sourced from create-a-dervied-table

Definition of done

Agree implementation approach Should mean the duplicate tests created here pass

LavMatt commented 4 months ago

My current thinking for best way to implement this filter is to load in a target_platform_instance via the dbt ingestion recipe e.g. target_platform_instance: cadet. This could then be used to filter out results where platform is athena and platform instance is cadet in our search graphql query.

Unfortunately it appears from some test ingestions i've been running in dev that the native datahub dbt ingestion does not properly populate the target platform instance in that it does not appear in the dataPlatformInstance property via a graphql query.

I was able to properly populate a platform instance (and return the instance cadet from a graphql query) using the python client e.g.

metadata_event = MetadataChangeProposalWrapper(
    entityUrn="urn:li:dataset:(urn:li:dataPlatform:athena,cadet.awsdatacatalog.bold_sm_spells.prison_spells_offences_dim,PROD)",
    aspect=DataPlatformInstanceClass(
        platform=mce_builder.make_data_platform_urn("athena"),
        instance=mce_builder.make_dataplatform_instance_urn(
                    mce_builder.make_data_platform_urn("athena"),
                    "cadet",
                )
    ),
)
client.graph.emit(metadata_event)

The dataPlatformInstance property is set for the dbt entity, e.g. the model but not the entity created according to the target platform (athena in our case)

Need to investigate further if this is actually the intended behaviour of datahub's dbt ingestion or a bug

Believe this is the code in dbt source where target platform entity is created... is something missing for creating platform instance... https://github.com/acryldata/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py#L1395

LavMatt commented 4 months ago

asked question re. above to ingestion thread in datahub slack https://datahubspace.slack.com/archives/CUMUWQU66/p1716979453356899

LavMatt commented 4 months ago

The approach used is now not making use of the target platform instance directly. A mandated tag has been introduced to the search method in our datahub client, so if an entity is not tagged with display_in_catalogue then it is omitted from search results and hence not displayed in find-moj-data.