reata / sqllineage

SQL Lineage Analysis Tool powered by Python
MIT License
1.27k stars 231 forks source link

Column Lineage doesn't work properly for Spark StructType columns #411

Open keen85 opened 1 year ago

keen85 commented 1 year ago

Context

Spark / Spark SQL supportes nested datatypes. So called StructType are basically a container for other attributes.

So a schema of a Spark Table/View is not necessary a flat list but can be a tree:

root
 |-- some_int: integer (nullable = false)
 |-- some_string: string (nullable = false)
 |-- some_struct: struct (nullable = false)
 |    |-- int_in_struct: integer (nullable = false)
 |    |-- long_in_struct: long (nullable = false)

in this example int_in_struct is a child attribute of some_struct

Problem

I noticed that column level lineage does not work properly for these nested attributes.

Minimal example to demonstrate the issue

from sqllineage.runner import LineageRunner

sql = """
CREATE OR REPLACE TEMPORARY VIEW sqllineage_source AS
SELECT
  1 AS `some_int`,
  'foo' AS `some_string`,
  named_struct('int_in_struct', 2, 'long_in_struct', CAST(3 AS LONG)) AS `some_struct`
;

CREATE OR REPLACE TEMPORARY VIEW sqllineage_target AS
SELECT
  `some_int`,
  `some_struct`.`int_in_struct` + `some_struct`.`long_in_struct` AS `calculated_sum`
FROM
  sqllineage_source
;
"""
result = LineageRunner(sql, dialect='sparksql')
result.print_column_lineage()
"""
result = LineageRunner(sql, dialect='sparksql')
result.print_column_lineage()

actual output

<default>.sqllineage_target.calculated_sum <- <default>.some_struct.int_in_struct
<default>.sqllineage_target.calculated_sum <- <default>.some_struct.long_in_struct
<default>.sqllineage_target.some_int <- <default>.sqllineage_source.some_int

note that the name of the source view (sqllineage_source) is not listed for the lineage of column calculated_sum.

expected output

<default>.sqllineage_target.calculated_sum <- <default>.sqllineage_source.some_struct.int_in_struct
<default>.sqllineage_target.calculated_sum <- <default>.sqllineage_source.some_struct.long_in_struct
<default>.sqllineage_target.some_int <- <default>.sqllineage_source.some_int
reata commented 1 year ago

This is a known issue that we cannot handle complex type.

By merely look at the SQL as text:

SELECT
  `some_int`,
  `some_struct`.`int_in_struct` + `some_struct`.`long_in_struct` AS `calculated_sum`
FROM
  sqllineage_source

We don't know some_struct is a struct column, or this is just semantically incorrect SQL because it refers to another table called some_struct. For your case, we can infer from the previous temporary view statement, but that is not universal enough.

I would postpone dealing with this until we brings in metadata integration.

Update: this is somewhat related to #303

keen85 commented 1 year ago

Hi @reata, thanks for clarifying.

.. until we brings in metadata integration.

Is that planned?

reata commented 1 year ago

Hi @reata, thanks for clarifying.

.. until we brings in metadata integration.

Is that planned?

Yes, that's the major feature we're trying to bring in v1.5.x.

keen85 commented 5 months ago

Hi @reata, version 1.5.1 is out since recently. Will it be possible to cover my user case from above now? 😅

reata commented 5 months ago

we'll start considering. might not be prioritized as yet as there're still more metadata related issues to be handled.

zw620 commented 3 months ago

Hi @reata, may I ask that when will you release this feature: sqllineage support StructType columns? Thanks a lot.

reata commented 3 months ago

Hi @reata, may I ask that when will you release this feature: sqllineage support StructType columns? Thanks a lot.

With metadata provider, it's theoretically possible to support this.

For struct in defined DDL, it's defined this way:

some_struct struct<ini_in_struct int, long_in_struct long>

For struct defined in DML, it's using named_struct function (or implicitly using struct function):

named_struct('int_in_struct', 2, 'long_in_struct', CAST(3 AS LONG)) AS `some_struct`

For DDL we might need a separate parser for this grammar; For DML it's easier when we treat this just as regular functions.

We also need to think of complex data type support in other SQL dialects, which may not be exactly the same as sparksql and we don't want this solution to be very specific to sparksql.

So to summarize, we don't have a very solid design to get this started yet.

keen85 commented 3 months ago

Hi @reata , not sure if this helps, but in (Py)Spark there are several representations of a DataFrame schema. Maybe one is more suitable than the DDL struct representation?

setup

spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW sqllineage_source AS
SELECT
  1 AS `some_int`,
  'foo' AS `some_string`,
  named_struct('int_in_struct', 2, 'long_in_struct', CAST(3 AS LONG)) AS `some_struct`
;
""")

schema representations

StructType object

DataFrame.schema

spark.table("sqllineage_source").schema
>>> StructType([StructField('some_int', IntegerType(), False), StructField('some_string', StringType(), False), StructField('some_struct', StructType([StructField('int_in_struct', IntegerType(), False), StructField('long_in_struct', LongType(), False)]), False)])

DDL-similar short string

StructType.simpleString()

spark.table("sqllineage_source").schema.simpleString()
>>> 'struct<some_int:int,some_string:string,some_struct:struct<int_in_struct:int,long_in_struct:bigint>>'

JSON-string

StructType.json()

spark.table("sqllineage_source").schema.json()
>>> '{"fields":[{"metadata":{},"name":"some_int","nullable":false,"type":"integer"},{"metadata":{},"name":"some_string","nullable":false,"type":"string"},{"metadata":{},"name":"some_struct","nullable":false,"type":{"fields":[{"metadata":{},"name":"int_in_struct","nullable":false,"type":"integer"},{"metadata":{},"name":"long_in_struct","nullable":false,"type":"long"}],"type":"struct"}}],"type":"struct"}'

dict

StructType.jsonValue()

spark.table("sqllineage_source").schema.jsonValue()
>>> {'type': 'struct',
 'fields': [{'name': 'some_int',
   'type': 'integer',
   'nullable': False,
   'metadata': {}},
  {'name': 'some_string', 'type': 'string', 'nullable': False, 'metadata': {}},
  {'name': 'some_struct',
   'type': {'type': 'struct',
    'fields': [{'name': 'int_in_struct',
      'type': 'integer',
      'nullable': False,
      'metadata': {}},
     {'name': 'long_in_struct',
      'type': 'long',
      'nullable': False,
      'metadata': {}}]},
   'nullable': False,
   'metadata': {}}]}

SparkSQL DDL string

spark.sparkContext._jvm.org.apache.spark.sql.types.DataType.fromJson(spark.table("sqllineage_source").schema.json()).toDDL()
>>> 'some_int INT NOT NULL,some_string STRING NOT NULL,some_struct STRUCT<int_in_struct: INT, long_in_struct: BIGINT> NOT NULL'
reata commented 2 months ago

I consider this issue as the starting point for supporting complex data type column lineage. Going for DDL is the more universal way. I don't want the implementation to be coupled with Spark despite the fact that we might start small with limited support for SparkSQL at the very beginning.

ANSI SQL doesn't include JSON data type until SQL:2023, which is still a semi-structured data type. Structured data type is not mentioned so the implementations are different among dialects, for example:

Bigquery shares the exact same pattern as SparkSQL: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#declaring_a_struct_type

Snowflake names it Object instead of Struct and grammar is slightly different: https://docs.snowflake.com/en/sql-reference/data-types-structured

PostgreSQL and Oracle requires user to define type separately thus our solution will likely be very different (if we ever want to support that): https://www.postgresql.org/docs/current/rowtypes.html https://docs.oracle.com/en/database/oracle/oracle-database/21/adobj/key-features-object-relational-model.html#GUID-9256ECD9-596B-4ABF-AA79-1890BE28F4F7

MySQL and T-SQL only support JSON, structured data type is not supported.

Hope this clarifies why this issue has been put on hold.