google / zetasql

ZetaSQL - Analyzer Framework for SQL
Apache License 2.0
2.28k stars 214 forks source link

No way to determine alias of column in CTE #104

Closed tha23rd closed 2 years ago

tha23rd commented 2 years ago

Hi all, we are trying to determine how to connect the dots between output columns and columns they reference in a CTE that are aliased using the Java version of ZetaSQL

Here's an example query:

WITH
  test_cte AS (
  SELECT
    user_id AS uid
  FROM
    my_table)
SELECT
  uid
FROM
  test_cte

The parsed AST can be seen in this gist: https://gist.github.com/tha23rd/7ac2737cfca9a2580a3ce4726e88c6b0

My question is how we connect the output column "uid" to the column "user_id" in the CTE? Looking at the AST, there's no field for the alias like there is in the high level "outputColumnsList" field.

Thanks!

matthewcbrown commented 2 years ago

Short answer is, there is no direct representation of the alias name in the resolved AST.

Follow up question, what is your actual use case? Why do you want this alias?

Long answer:

Googlesql doesn't guarantee that it preserves intermediate names during resolution. This will usually be preserved in a referencing column (if it is referenced) as part of a ResolvedColumn.getName(). Howevever, this is considered debugging information, and is not guaranteed to be preserved/stable over time. On the c++ side, this field is labeled to have no guaranteed value or meaning, but usually contains some useful information. Only the 'id' is guaranteed to be meaningful, but less so in the specific case of a ResolvedWithRefScan it relies on aligning the column_list of the refercing and referenced scans:

bazel run //zetasql/tools/execute_query -- --mode=resolve  '
WITH 
  my_table as (select 1 as user_id),
  test_cte AS (SELECT user_id AS uid, 5 as unused FROM my_table)
SELECT uid FROM test_cte;

Prints out:

QueryStmt
+-output_column_list=
| +-test_cte.uid#4 AS uid [INT64]
+-query=
  +-WithScan
    +-column_list=[test_cte.uid#4]
    +-with_entry_list=
    | +-WithEntry
    | | +-with_query_name="my_table"
    | | +-with_subquery=
    | |   +-ProjectScan
    | |     +-column_list=[my_table.user_id#1]
    | |     +-expr_list=
    | |     | +-user_id#1 := Literal(type=INT64, value=1)
    | |     +-input_scan=
    | |       +-SingleRowScan
    | +-WithEntry
    |   +-with_query_name="test_cte"
    |   +-with_subquery=
    |     +-ProjectScan
    |       +-column_list=[my_table.user_id#2, test_cte.unused#3]
    |       +-expr_list=
    |       | +-unused#3 := Literal(type=INT64, value=5)
    |       +-input_scan=
    |         +-WithRefScan(column_list=[my_table.user_id#2], with_query_name="my_table")
    +-query=
      +-ProjectScan
        +-column_list=[test_cte.uid#4]
        +-input_scan=
          +-WithRefScan(column_list=test_cte.[uid#4, unused#5], with_query_name="test_cte")

So, in WithRefScan we align the column_list with test_cte's (output) column_list we get: WithEntry.ProjectScan: column_list=[my_table.user_id#2, test_cte.unused#3] WithRefScan: column_list=test_cte.[uid#4, unused#5]

Thus, we can link together uid#4 -> my_table.user_id#2 because they are both at offset 0 in the two arrays. Normally ids can be used for such linkage (such as if you rewrite this to use sub-queries rather than WITH).

While you can extract the mapping of the alias name here, it isn't recommended, and cannot be relied on in general.

Documentation of ResolvedWithRefScan Documentation of c++ ResolvedColumn::name

tha23rd commented 2 years ago

Hey Matthew!

Thanks so much for the detailed response. To answer your initial question, our goal is to create column level lineage for our tables in BigQuery.

The solution you outlined is exactly what we ended up doing, someone on my team just happened to notice that the indices in the column list were the same :)

With this, we more or less have a working column level lineage solution by traversing the tree and creating an adjacency list for the main set of columns and for each CTE that's present. We then merge these adjacency lists together using the index mentioned above.

The only thing we aren't 100% sure about is how to map our UDFs to Functions in zetaSQL, but I'll open another issue for that. Thanks again for taking the time to respond!!!