open-metadata / OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
https://open-metadata.org
Apache License 2.0
5.58k stars 1.05k forks source link

problems with ingesting lineage #15784

Closed manuelateles closed 7 months ago

manuelateles commented 7 months ago

Affected module this is impacting the lineage

Describe the bug I'm working with SQL server. I've granted the user with the GRANT VIEW SERVER STATE. When i run the metadata ingestion I run into this problem:

[2024-03-20T11:15:37.230+0000] {sql_lineage.py:385} DEBUG - Running lineage with query: CREATE VIEW [dbo].[An_View] AS

SELECT DISTINCT
    ColumnA,
    ValueLiteral1 AS AliasIDType,
    'TypeLiteral1' AS AliasType,
    ColumnB,
    CASE
        WHEN ColumnCondition1 = 'Value1' THEN
            (SELECT SubColumn1 FROM SubTable1 WHERE IndustryGroupCondition = ColumnCondition2)
        WHEN ColumnCondition1 = 'Value2' THEN 'ValueLiteral2'
        ELSE 'ValueLiteral3'
    END AS AliasAssetClass,
    CASE
        WHEN RankCondition = 'Value3' AND CollatCondition = 'Value4' THEN
            (SELECT Description FROM SubTable2 WHERE NameCondition = RankCondition)
        ELSE
            (SELECT Description FROM SubTable2 WHERE NameCondition = CollatCondition)
    END AS AliasAssetRank,
    MainTable.AliasCountryCode
FROM MainTable
WHERE TimestampCondition = (
    SELECT MAX(InnerTimestamp) FROM MainTable AS InnerTable WHERE InnerTable.ColumnMatch = MainTable.ColumnMatch
)
AND MainTable.ColumnA NOT IN (
    SELECT ExclusionColumn FROM ExclusionTable WHERE MarketBookCondition AND CodeCondition <> 'ExclusionValue'
)
AND MainTable.ColumnA NOT IN (
    SELECT InnerColumnA FROM SubTable3 WHERE InnerColumnMatch = MainTable.ColumnMatch AND InnerTimestamp > MainTable.TimestampCondition
)
AND MainTable.ColumnA <> 'SpecificExclusion'

UNION

SELECT DISTINCT
    ColumnA,
    ValueLiteral4 AS AliasIDType,
    'TypeLiteral2' AS AliasType,
    ColumnC,
    CASE
        WHEN RankCondition = 'Value3' AND CollatCondition = 'Value6' THEN
            (SELECT Description FROM SubTable4 WHERE NameCondition = ColumnD)
        ELSE
            (SELECT Description FROM SubTable4 WHERE NameCondition = CollatCondition)
    END AS AliasAssetRank,
FROM SubTable5
WHERE RequestDateCondition = (
    SELECT MAX(InnerRequestDate) FROM SubTable5 AS InnerTable2 WHERE InnerTable2.ColumnMatch = SubTable5.ColumnMatch
)
AND SubTable5.ColumnA NOT IN (
    SELECT ExclusionColumn FROM ExclusionTable WHERE MarketBookCondition2
)

UNION

SELECT DISTINCT
    ExclusionTable.Code AS ColumnA,
    ValueLiteral7 AS AliasIDType,
    'TypeLiteral3' AS AliasType,
    ExclusionTable.BondsName,
    ExclusionTable.IssueAmount,
    ExclusionTable.CurrenciesShortname,
    ExclusionTable.CouponRate,
    Countries.AliasCountryShortName
FROM ExclusionTable
INNER JOIN BondsTable ON BondsCondition
WHERE ExclusionTable.MarketBook = LiteralValue8
AND ExclusionTable.Code <> 'ExclusionValue'
AND ExclusionTable.Code NOT IN (
    SELECT DISTINCT ColumnA FROM MainTable
    UNION
    SELECT DISTINCT ColumnA FROM SubTable5
);
11:15:37.560+0000] {parser.py:439} DEBUG - Lineage with SqlFluff failed for the [tsql] query: [CREATE   view dbo.AllCWDT as
(the same query again)

I also get the following error for some other queries: Lineage computed with SqlFluff did not perform as expected for the [tsql] query: [CREATE view Vw_table

These errors appear multiple times in the metadata logs. And only the lineage of view tables is present, and the lineage of regular tables is missing.

I've also observed that when a view's query employs a UNION, the lineage mapping (when it works) includes the entire query. However, if the query utilizes a JOIN that involves a subquery, the lineage is only mapped to that subquery. For example, with this query, we have the lineage for all the views that are mentioned in the query

CREATE VIEW [dbo].[vw_ARSV] AS

SELECT
    Column1, 
    Column2, 
    Column3,
    SUM(Column4),
    IIF(SUM(Column5) > 0, SUM(Column5), 0) AS POS,
    IIF(SUM(Column5) < 0, SUM(Column5), 0) AS FX
FROM Table1 AS t1
GROUP BY
    Column1,
    Column2,
    Column3

UNION ALL

SELECT
    Column1_2,
    Column2_2,
    Column3_2,
    SUM(Column4_2),
    IIF(SUM(Column5_2) > 0, SUM(Column5_2), 0) AS CMV,
    IIF(SUM(Column5_2) < 0, SUM(Column5_2), 0) AS CMVN
FROM Table2 AS t2
GROUP BY
    Column1_2,
    Column2_2,
    Column3_2

UNION ALL

SELECT
    Column1_3,
    Column2_3,
    Column3_3,
    SUM(Column4_3),
    IIF(SUM(Column5_3) > 0, SUM(Column5_3), 0) AS ABC,
    IIF(SUM(Column5_3) < 0, SUM(Column5_3), 0) AS CBA
FROM Table3 AS t3
GROUP BY
    Column1_3,
    Column2_3,
    Column3_3;

But for this query, the lineage only shows the table Table3 but not the Table1 table. It should show both. Is that related to the fact that we are using sub-queries?

CREATE VIEW [dbo].[AnFXRiskView] AS

SELECT 
    'S' AS Rubric,
    Column1,
    Column2,
    IIF(Column3 <= 90, 'Up to 3 months', 'More than 3 months') AS TT,
    NULL AS Placeholder1,
    'Long' AS PST,
    Column4,
    NULL AS Placeholder2
FROM [DatabaseSchema].[dbo].[Table1]

UNION ALL

SELECT 
    'T' AS Rubric,
    Column5,
    Column6,
    IIF(Column7 <= 90, 'Up to 3 months', 'More than 3 months') AS TT,
    Column8,
    Column9,
    CalculatedColumn10,
    CalculatedColumn11,
    CalculatedColumn12,
    CalculatedColumn13
FROM [DatabaseSchema].[dbo].[Table2] AS T
LEFT JOIN 
    (SELECT DISTINCT
        Column14,
        Column15,
        Column16,
        Column17,
        Column18,
        Column19,
        Column20,
        Column21,
        Column22
     FROM [DatabaseSchema].[dbo].[Table3]) AS P
ON 
    P.Column14 = T.Column14
    AND P.Column15 = T.Column15
    AND P.Column20 = T.TermType
    AND P.Column16 = T.Column16
    AND P.Column17 = T.Column17;

I don't understand why these errors are happening. I also leave the link to the thead in Slack for any more information you may need: https://openmetadata.slack.com/archives/C02B6955S4S/p1710935109949359

Version:

harshach commented 7 months ago

tracking here https://github.com/open-metadata/OpenMetadata/issues/7427#issuecomment-2032538352

manuelateles commented 7 months ago

I keep looking into the views and the lineage, trying to find anymore structures that aren't working and I found that if I also use this way to indicate my coluns: FROM [DBName].[SchemaName].[TableName1] ac, [DBName].[SchemaName].[TableName2] t, [DBName].[SchemaName].[TableName3] c, [DBName].[SchemaName].[TableName4] a

the lineage also doesn't work. I'll Keep posting any more finds I make