Tableau Connector : Unify Data Models

open-metadata / OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

https://open-metadata.org

Apache License 2.0

4.8k stars 922 forks source link

Tableau Connector : Unify Data Models #15218

Open jsampaiog opened 4 months ago

jsampaiog commented 4 months ago

Is your feature request related to a problem? Please describe. When ingesting Data Models in tableau, multiple datamodels are displayed for the same data source. This explodes the number of total data sources, even though unique, and makes discovery and lineage more complicated.

Describe the solution you'd like Today OMD relies on the nodes segment of Tableau metadata to create the data model.

  embeddedDatasourcesConnection(first: {first}, offset: {offset} ) {{
    nodes {{
      id
      name
      fields {{
        id
        name
        upstreamColumns{{
          id
          name
          remoteType
        }}

But perhaps a better way would be to create the data model based on the root data model, since these share the same ID across the models

chillerno1 commented 4 months ago

@harshach I wanted to chime in on this conversation. At my organisation, we're ingesting a large Tableau instance and we've also noticed this behavior where there are multiple versions of the same datasource. What we found, is that a workbook (dashboard) can have it's own embedded datasource, that is a workbook unique version of an upstream datasource that it connects to (usually one that exists on Tableau server). The reason for this, seems to be, that a workbook can connect to a datasource, then change field names, add calculated fields and do various other things to have it's own version of the connected datasource.

Ideally what we would like (and @jsampaiog please jump in if you disagree); is for published datasources to be ingested into OpenMetadata as well as the embedded datasources (would be nice to have a new icon to differentiate the datamodels).

I think it's important to keep both the published and embedded datasources, because that way we can see what transformations have occurred at the workbook level and compare it to the published server model.

Here's a screenshot of what it might look like:

jsampaiog commented 4 months ago

Hi @chillerno1, thanks for chiming in. Indeed your depicted behavior would be the best target scenario! But we also brainstormed internally, and as a matter of fact, in order to avoid complexifying OpenMetaData Data Model, if we were forced to choose between "Published datasources" and "Embedded datasources", we would stick with the first.

chillerno1 commented 4 months ago

Thanks @jsampaiog, I agree with that!

nicor88 commented 2 months ago

Chiming in too. Pretty much I have a similar scenario to what @jsampaiog described.

A published data source that then it's used in multiple places. We are planning to use this for more scenarios, therefore the amount of data models can simply explode. What @jsampaiog suggested in the original issue seems the way to go:

But perhaps a better way would be to create the data model based on the root data model, since these share the same ID across the models

nicor88 commented 2 months ago

@pmbrull are you still planning to include this in 1.4.0 release? I see that was removed :(

pmbrull commented 2 months ago

hi we had to reprioritize certain topics and ran out of time to handle this, so 1.4.1 - 1.5 would be the new ETA.

My 2 cents on the conversation above is to keep things simple. Aiming to keep the Published DataModel IMO would be the way to go to reduce complexity

nicor88 commented 2 months ago

@pmbrull thanks for the context on the timelines.

I believe that "Published DataModel" should do the job even in case of "Dashboard" with embedded data models. We just need to be sure that we don't introduce a regression, where data models are totally missing.

17rahulsharma commented 2 months ago

Thanks @OnkarVO7 , for the this thread.

We also have similar problem of having duplicate Models rather a combined model for all workbooks down the stream.

Since currently OMD use this query

query { embeddedDatasourcesConnection(filter: {name: "Tech Data Model"}) { nodes{ id name workbook { id name } } totalCount } }

We checked with Tableau team (spent a lot of time with Tableau support team to get information in right way) and they proposed to use below query

query { publishedDatasources(filter: {name: "Tech Data Model"}) { id name hasExtracts downstreamWorkbooks{ id luid name } } }

triquinielas commented 2 months ago

Hi,

Sorry for commenting in this thread, we are facing the same situation: the sources are duplicated for each workbook (dashboard in OM) that we ingested.

The dashboard datamodel exists only once on Tableau:

If the object exists only once, we can trace lineage with the workbooks, assign the owner once, not make them independent objects.

In addition, we have it separated by different services, each service is a tableau folder, since this allows us to assign owner by folder, perhaps, if in the ingestion the folder (tableau) is ingested as the database service would allow us to maintain that hierarchy that also allows us to filter by folder:

DB Ingestion->Schema->table Tableau Ingestion >Folder->Workbook & datamodels

thanks, Carlos

nicor88 commented 2 days ago

Here a recap on the conversation that I had with @OnkarVO7 .

the query, must be changed to add this section:

 upstreamDatasources {
      id
      luid
      name
      description
      hasExtracts
      tags {
        id
      }
      fields {
        id
        name
        isHidden
      }
      upstreamTables {
        id
        luid
        name
        fullName
        schema
        referencedByQueries {
          id
          name
          query
        }
        columns {
          id
          name
        }
        database {
          id
          name
        }
      }
    }

the ingestion must have a logic to use the new field upstreamDatasources. If the upstreamDatasources is not empty (that's the case of publishedDatasources) we need to publish a new data model node and link it to the underyling data-source downstream and upstream to the related data-model in tableau.