Open keen85 opened 1 year ago
This is a known issue that we cannot handle complex type.
By merely look at the SQL as text:
SELECT
`some_int`,
`some_struct`.`int_in_struct` + `some_struct`.`long_in_struct` AS `calculated_sum`
FROM
sqllineage_source
We don't know some_struct is a struct column, or this is just semantically incorrect SQL because it refers to another table called some_struct. For your case, we can infer from the previous temporary view statement, but that is not universal enough.
I would postpone dealing with this until we brings in metadata integration.
Update: this is somewhat related to #303
Hi @reata, thanks for clarifying.
.. until we brings in metadata integration.
Is that planned?
Hi @reata, thanks for clarifying.
.. until we brings in metadata integration.
Is that planned?
Yes, that's the major feature we're trying to bring in v1.5.x.
Hi @reata, version 1.5.1 is out since recently. Will it be possible to cover my user case from above now? 😅
we'll start considering. might not be prioritized as yet as there're still more metadata related issues to be handled.
Hi @reata, may I ask that when will you release this feature: sqllineage support StructType columns? Thanks a lot.
Hi @reata, may I ask that when will you release this feature: sqllineage support StructType columns? Thanks a lot.
With metadata provider, it's theoretically possible to support this.
For struct in defined DDL, it's defined this way:
some_struct struct<ini_in_struct int, long_in_struct long>
For struct defined in DML, it's using named_struct function (or implicitly using struct function):
named_struct('int_in_struct', 2, 'long_in_struct', CAST(3 AS LONG)) AS `some_struct`
For DDL we might need a separate parser for this grammar; For DML it's easier when we treat this just as regular functions.
We also need to think of complex data type support in other SQL dialects, which may not be exactly the same as sparksql and we don't want this solution to be very specific to sparksql.
So to summarize, we don't have a very solid design to get this started yet.
Hi @reata , not sure if this helps, but in (Py)Spark there are several representations of a DataFrame schema. Maybe one is more suitable than the DDL struct representation?
spark.sql("""
CREATE OR REPLACE TEMPORARY VIEW sqllineage_source AS
SELECT
1 AS `some_int`,
'foo' AS `some_string`,
named_struct('int_in_struct', 2, 'long_in_struct', CAST(3 AS LONG)) AS `some_struct`
;
""")
StructType
objectspark.table("sqllineage_source").schema
>>> StructType([StructField('some_int', IntegerType(), False), StructField('some_string', StringType(), False), StructField('some_struct', StructType([StructField('int_in_struct', IntegerType(), False), StructField('long_in_struct', LongType(), False)]), False)])
StructType.simpleString()
spark.table("sqllineage_source").schema.simpleString()
>>> 'struct<some_int:int,some_string:string,some_struct:struct<int_in_struct:int,long_in_struct:bigint>>'
StructType.json()
spark.table("sqllineage_source").schema.json()
>>> '{"fields":[{"metadata":{},"name":"some_int","nullable":false,"type":"integer"},{"metadata":{},"name":"some_string","nullable":false,"type":"string"},{"metadata":{},"name":"some_struct","nullable":false,"type":{"fields":[{"metadata":{},"name":"int_in_struct","nullable":false,"type":"integer"},{"metadata":{},"name":"long_in_struct","nullable":false,"type":"long"}],"type":"struct"}}],"type":"struct"}'
StructType.jsonValue()
spark.table("sqllineage_source").schema.jsonValue()
>>> {'type': 'struct',
'fields': [{'name': 'some_int',
'type': 'integer',
'nullable': False,
'metadata': {}},
{'name': 'some_string', 'type': 'string', 'nullable': False, 'metadata': {}},
{'name': 'some_struct',
'type': {'type': 'struct',
'fields': [{'name': 'int_in_struct',
'type': 'integer',
'nullable': False,
'metadata': {}},
{'name': 'long_in_struct',
'type': 'long',
'nullable': False,
'metadata': {}}]},
'nullable': False,
'metadata': {}}]}
spark.sparkContext._jvm.org.apache.spark.sql.types.DataType.fromJson(spark.table("sqllineage_source").schema.json()).toDDL()
>>> 'some_int INT NOT NULL,some_string STRING NOT NULL,some_struct STRUCT<int_in_struct: INT, long_in_struct: BIGINT> NOT NULL'
I consider this issue as the starting point for supporting complex data type column lineage. Going for DDL is the more universal way. I don't want the implementation to be coupled with Spark despite the fact that we might start small with limited support for SparkSQL at the very beginning.
ANSI SQL doesn't include JSON data type until SQL:2023, which is still a semi-structured data type. Structured data type is not mentioned so the implementations are different among dialects, for example:
Bigquery shares the exact same pattern as SparkSQL: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#declaring_a_struct_type
Snowflake names it Object instead of Struct and grammar is slightly different: https://docs.snowflake.com/en/sql-reference/data-types-structured
PostgreSQL and Oracle requires user to define type separately thus our solution will likely be very different (if we ever want to support that): https://www.postgresql.org/docs/current/rowtypes.html https://docs.oracle.com/en/database/oracle/oracle-database/21/adobj/key-features-object-relational-model.html#GUID-9256ECD9-596B-4ABF-AA79-1890BE28F4F7
MySQL and T-SQL only support JSON, structured data type is not supported.
Hope this clarifies why this issue has been put on hold.
Context
Spark / Spark SQL supportes nested datatypes. So called
StructType
are basically a container for other attributes.So a schema of a Spark Table/View is not necessary a flat list but can be a tree:
in this example
int_in_struct
is a child attribute ofsome_struct
Problem
I noticed that column level lineage does not work properly for these nested attributes.
Minimal example to demonstrate the issue
actual output
note that the name of the source view (
sqllineage_source
) is not listed for the lineage of columncalculated_sum
.expected output