opendatadiscovery / odd-platform

First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
https://opendatadiscovery.org
Apache License 2.0
1.16k stars 96 forks source link

Data Entities with DEG does not show correct lineages #1408

Open ghosalya opened 10 months ago

ghosalya commented 10 months ago

Describe the bug

A Data Entity that is both a DataSet and a DataEntityGroup loses it's lineage information regarding the dataset, and only have lineage from the DataEntityGroup.

Set up

ODD-Platform v0.15.0 (ghcr.io/opendatadiscovery/odd-platform:0.15.0)

Steps to Reproduce

There is a code to reproduce this behavior: https://gist.github.com/ghosalya/aa25b2903d3d5bf728a8b8aad9731cec It uses odd-models-package to call the Ingestion API to create Data Entities

Steps to reproduce the behavior:

  1. Have odd-platform running at http://localhost:8080 (I followed this section of README.md for docker)

  2. Go to http://localhost:8080, and create a collector (Management -> Collectors -> Add Collector). Export the toke as env variable ODD_PLATFORM_TOKEN

  3. Install odd-models-package

  4. Run odd_widget_example.py from the gist. This will create a number of entities e.g. WIDGET_TABLE

  5. Go to http://localhost:8080, look for WIDGET_TABLE dataset and check the Lineage tab. It should show widget_job -> widget_table lineage image

  6. Now run odd_widget_example_deg.py, this will modify WIDGET_TABLE to have a DataEntityGroup component

  7. Go to http://localhost:8080, look for WIDGET_TABLE dataset; it should have a DEG component like so image

  8. Go to WIDGET_TABLE's Lineage tab

Expected behavior

The Lineage tab should still show widget_job -> widget_table

image

Current behavior

The Lineage tab is overridden by the DEG component and only shows the DEG members, and we lose the original lineage.

image

Additional context

The code to submit data entity list uses odd-models==2.0.31

DementevNikita commented 10 months ago

Hey @ghosalya!

Firstly, thank you for opening this ticket and for the comprehensive description you've provided!

The issue you're encountering stems from the combination of a dataset and a DEG. In instances like these, the ODD Platform prioritizes the lineage of the DEG. Moreover, during metadata ingestion, ODD Platform doesn’t cross-check against these specific classes and permits the creation of such combinations.

For us to address this effectively, could you shed some light on the rationale behind designating an entity as both a dataset and a DEG simultaneously? It's essential for us to grasp the underlying intentions so we can determine the best path forward and ensure that creating a DEG and dataset within the same entity is indeed meaningful

ghosalya commented 10 months ago

Hi @DementevNikita

For us to address this effectively, could you shed some light on the rationale behind designating an entity as both a dataset and a DEG simultaneously? It's essential for us to grasp the underlying intentions so we can determine the best path forward and ensure that creating a DEG and dataset within the same entity is indeed meaningful

This is one of the workarounds we are trying with https://github.com/opendatadiscovery/odd-platform/issues/1407

Essentially, we want a DataEntity that is a DataSet (i.e. WIDGET_TABLE), but also has a component that lists the versions of this dataset (WIDGET_TABLE_V1, WIDGET_TABLE_V2). In this case, I would like the lineage of WIDGET_TABLE to derive from its DataSet lineage, since it is first and foremost a table.