Closed seanprivett closed 4 months ago
My current thinking for best way to implement this filter is to load in a target_platform_instance
via the dbt ingestion recipe e.g. target_platform_instance: cadet
. This could then be used to filter out results where platform is athena
and platform instance is cadet
in our search graphql query.
Unfortunately it appears from some test ingestions i've been running in dev that the native datahub dbt ingestion does not properly populate the target platform instance in that it does not appear in the dataPlatformInstance
property via a graphql query.
I was able to properly populate a platform instance (and return the instance cadet
from a graphql query) using the python client e.g.
metadata_event = MetadataChangeProposalWrapper(
entityUrn="urn:li:dataset:(urn:li:dataPlatform:athena,cadet.awsdatacatalog.bold_sm_spells.prison_spells_offences_dim,PROD)",
aspect=DataPlatformInstanceClass(
platform=mce_builder.make_data_platform_urn("athena"),
instance=mce_builder.make_dataplatform_instance_urn(
mce_builder.make_data_platform_urn("athena"),
"cadet",
)
),
)
client.graph.emit(metadata_event)
The dataPlatformInstance
property is set for the dbt entity, e.g. the model but not the entity created according to the target platform (athena in our case)
Need to investigate further if this is actually the intended behaviour of datahub's dbt ingestion or a bug
Believe this is the code in dbt source where target platform entity is created... is something missing for creating platform instance... https://github.com/acryldata/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_common.py#L1395
asked question re. above to ingestion thread in datahub slack https://datahubspace.slack.com/archives/CUMUWQU66/p1716979453356899
The approach used is now not making use of the target platform instance directly. A mandated tag has been introduced to the search method in our datahub client, so if an entity is not tagged with display_in_catalogue
then it is omitted from search results and hence not displayed in find-moj-data.
User Story
As a user of the data catalogue I expect to not be confused/overwhelmed by duplicate entities So i have a simple search experience
Context
Currently
find-moj-data
treats datahub dbt models/sources/seeds as Tables and also treats Athena Datasets as tables. This means search results present what appear as duplicate entitiesProposal
We only return the dbt entity type (be that model, seed or source) from datahub in any search queries and filter out the assocaited athena entity from all results - This can remain as a table entity for now but will likely need some input from UR around how we best classify dbt data sourced from create-a-dervied-table
Definition of done
Agree implementation approach Should mean the duplicate tests created here pass