reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.34k stars 242 forks source link

Reserve Identifier Case While Doing Case-Insensitive Lineage Analysis #505

Open laoqi111 opened 11 months ago

laoqi111 commented 11 months ago

The SQL statement is as follows: insert into test.dbo.Students(FirstName) select FirstName from test.dbo.Teachers

The parsed result: test.dbo.students.firstname <- test.dbo.teachers.firstname

reata commented 10 months ago

This is expected behavior because by default, table/column name are case-insensitive for most SQL dialect. A table named Students or students are essentially the same. We convert everything to lower case (unless it's escaped) so that it can be referred in the SQL file interchangeably. This is not perfect but solves most of the problems.

However, if case-sensitive parsing is preferred, we can provide a separate config to allow that.

May I ask what's the SQL dialect you're using? Is it case-sensitive by default or through database configs?

laoqi111 commented 10 months ago

I need to import the blood relationship parsed from SQL Lingage into the datahub. SQL Server is indeed not case sensitive when querying, and when the datahub ingests metadata from SQL Server, it is inevitable that the original tables and fields will appear in uppercase. At this point, the blood relationships I resolved using SQL cannot be imported correctly into the datahub, where Students and students are two separate tables. Thank you very much if you could provide a parameter to distinguish between uppercase and lowercase.

------------------ 原始邮件 ------------------ 发件人: "reata/sqllineage" @.>; 发送时间: 2023年12月23日(星期六) 中午1:58 @.>; @.>;"State @.>; 主题: Re: [reata/sqllineage] Configurable Case Sensitivity for Identifier Name (Issue #505)

This is expected behavior because by default, table/column name are case-insensitive for most SQL dialect. A table named Students or students are essentially the same. We convert everything to lower case (unless it's escaped) so that it can be referred in the SQL file interchangeably. This is not perfect but solves most of the problems.

However, if case-sensitive parsing is preferred, we can provide a separate config to allow that.

May I ask what's the SQL dialect you're using? Is it case-sensitive by default or through database configs?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you modified the open/close state.Message ID: @.***>

reata commented 10 months ago

Given that SQL Server is also case insensitive, then it's not a config for case sensitivity.

Instead, if we can provide a config to convert all table/column name to uppercase. Does that solve your problem?

laoqi111 commented 10 months ago

Of course, it can solve my problem. Thank you very much. I am very optimistic about the prospects of sqlliineage and hope to do better in complex SQL parsing in the future.

------------------ 原始邮件 ------------------ 发件人: "reata/sqllineage" @.>; 发送时间: 2023年12月23日(星期六) 晚上8:56 @.>; @.>;"State @.>; 主题: Re: [reata/sqllineage] Configurable Case Sensitivity for Identifier Name (Issue #505)

Given that SQL Server is also case insensitive, then it's not a config for case sensitivity.

Instead, if we can provide a config to convert all table/column name to uppercase. Does that solve your problem?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you modified the open/close state.Message ID: @.***>

delphisharp commented 10 months ago

我的意见如下,请参考:

  1. sqllineage 应该是有大小写敏感功能, 提供对特定数据库(Oracle)的支持。
  2. 最后的结果转为大写,我觉得应该是外部应用需要做的工作,而不应该是sqllineage的范围。

My opinions are as follows, please refer to:

  1. sqllineage should have case-sensitive functions and provide support for a specific database (Oracle).
  2. The final result is converted to uppercase. I think it should be the work that the external application needs to do, not the scope of sqllineage.
reata commented 10 months ago

我的意见如下,请参考:

  1. sqllineage 应该是有大小写敏感功能, 提供对特定数据库(Oracle)的支持。
  2. 最后的结果转为大写,我觉得应该是外部应用需要做的工作,而不应该是sqllineage的范围。

My opinions are as follows, please refer to:

  1. sqllineage should have case-sensitive functions and provide support for a specific database (Oracle).
  2. The final result is converted to uppercase. I think it should be the work that the external application needs to do, not the scope of sqllineage.

Partially agreed. We only support case-insensitive analysis by now. This is by design because ANSI SQL standard is case-insensitive. On the implementation side, the way we achieve this is by converting all identifier names to lower case. This approach is just chosen by random. We can achieve the same goal by converting identifier to upper case, too. That is why I think we can provide this feature. (Of course, users can do this on their own)

In the long term, we should provide a configuration to allow case-sensitive analysis. That requires more change.

delphisharp commented 10 months ago

没有很理解表达的意思。 现在有二个大小写问题

  1. 数据血缘解析结果是转为大写还是小写 我觉得可以在输出结果的代码中修改

2.当构建数据血缘时是否大小写敏感 我觉得可以在model的Schema Table Column 类里面修改

不知道 是说的哪种场景还是都有?

I don't quite understand the meaning of the expression. Now there are two capitalization issues

  1. Is the result of data lineage analysis converted to uppercase or lowercase? I think it can be modified in the code that outputs the results

  2. Is it case-sensitive when constructing data lineage? I think it can be modified in the Schema Table Column class of the model

I don’t know which scene you’re talking about or are there both?

laoqi111 commented 10 months ago

What I mean is that the generated blood relation field remains the same as before.

For example:

Insert into targetTable (Name, Age) select Name, Age from sourceTable.

Generate as follows:

"SourceTable. Name ->targetTable. Name".

Not all converted to uppercase or lowercase, but kept as is.

------------------ 原始邮件 ------------------ 发件人: "reata/sqllineage" @.>; 发送时间: 2024年1月12日(星期五) 下午5:14 @.>; @.>;"State @.>; 主题: Re: [reata/sqllineage] Add a Configuration to Convert Identifier Name to Uppercase (Issue #505)

没有很理解表达的意思。 现在有二个大小写问题

数据血缘解析结果是转为大写还是小写 我觉得可以在输出结果的代码中修改

2.当构建数据血缘时是否大小写敏感 我觉得可以在model的Schema Table Column 类里面修改 不知道 是说的哪种场景还是都有?

I don't quite understand the meaning of the expression. Now there are two capitalization issues

Is the result of data lineage analysis converted to uppercase or lowercase? I think it can be modified in the code that outputs the results

Is it case-sensitive when constructing data lineage? I think it can be modified in the Schema Table Column class of the model I don’t know which scene you’re talking about or are there both?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you modified the open/close state.Message ID: @.***>

reata commented 10 months ago

First of all, I think case-sensitive lineage, where Students and students are two different tables, is not in the scope of this issue. I'd like to align on that so everyone is one the same page.

As for ANSI case-insensitive rule, it only says Students and students are the same table, user can refer it in whatever case they like, even sTUdentS is still the same table. It doesn't say how you can achieve that. I can think of a few implementations:

  1. store all identifiers in lowercase, which is what we do right now.
  2. store all identifiers in uppercase. We don't have to change the current logic. Instead, we can provide a configure, so user can choose if they want uppercase of lowercase. This can be very easily done.
  3. store all identifiers as it is, do case-insensitive comparing.

Previously I was thinking of option 2, as this is a quick win.

option 3 is more desirable. But that may require more change.

@laoqi111 Can you kindly confirm you only want option 3, and that option 2 is not helping much.

laoqi111 commented 10 months ago

Yes, option 3 is what I need。

You are able to listen to the opinions of others and constantly improve, which is very admirable. I can feel that you are a group of people who truly love the program.

------------------ 原始邮件 ------------------ 发件人: "reata/sqllineage" @.>; 发送时间: 2024年1月13日(星期六) 晚上10:32 @.>; @.**@.>; 主题: Re: [reata/sqllineage] Add a Configuration to Convert Identifier Name to Uppercase (Issue #505)

First of all, I think case-sensitive lineage, where Students and students are two different tables, is not in the scope of this issue. I'd like to align on that so everyone is one the same page.

As for ANSI case-insensitive rule, it only says Students and students are the same table, user can refer it in whatever case they like, even sTUdentS is still the same table. It doesn't say how you can achieve that. I can think of a few implementations:

store all identifiers in lowercase, which is what we do right now.

store all identifiers in uppercase. We don't have to change the current logic. Instead, we can provide a configure, so user can choose if they want uppercase of lowercase. This can be very easily done.

store all identifiers as it is, do case-insensitive comparing.

Previously I was thinking of option 2, as this is a quick win.

option 3 is more desirable, and seems like what @laoqi111 are actually requiring. Can you kindly confirm?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

reata commented 10 months ago

Yes, option 3 is what I need。 You are able to listen to the opinions of others and constantly improve, which is very admirable. I can feel that you are a group of people who truly love the program.

Thanks for your confirmation. I just updated the issue title and we'll discuss internally on the plan.

piekill commented 6 months ago

Current implementation also results in incorrect column lineage in some cases, e.g.

sql = """
INSERT INTO res
SELECT t."C" FROM (
    SELECT a AS "C" FROM tab1
) AS t
"""
lr = LineageRunner(sql)
lr.print_column_lineage()
lr.draw()

will produce column lineage and figure: <default>.res.C <- t.c image

reata commented 6 months ago

Current implementation also results in incorrect column lineage in some cases, e.g.

sql = """
INSERT INTO res
SELECT t."C" FROM (
    SELECT a AS "C" FROM tab1
) AS t
"""
lr = LineageRunner(sql)
lr.print_column_lineage()
lr.draw()

will produce column lineage and figure: <default>.res.C <- t.c image

This is a definitely a bug we should fix. Currently it's case-insensitive, t.c and t.C should be the same column.