google / zetasql

ZetaSQL - Analyzer Framework for SQL
Apache License 2.0
2.32k stars 219 forks source link

Missing SqlBuilder Java Classes #8

Open christophstockhusen opened 5 years ago

christophstockhusen commented 5 years ago

Currently, the Java classes provide the possibility to analyze a SQL statement and get a resolved AST. However, the it is not possible to create a SQL statement from the resolved AST, because the required Java classes are missing (even though the corresponding C++ classes are available), i.e. SqlBuilder.java and its dependencies.

matthewcbrown commented 5 years ago

This was improved a bit in c610a21ffdc110293c1c7bd255a2674ebc7ec7a8 - you can call Analyzer.BuildExpression, but there still isn't a great way to actually modify a ResolvedExpr (if that is your goal). Can you talk about your usecase?

christophstockhusen commented 5 years ago

We are maintaining a data warehouse of dozens of BigQuery datasets with sensitive personal data of our customers. By GDPR, many of collegues do not have the permission to actually see some sensitive personal parts of the raw data (this includes parts like customer numbers and even pseudonyms), but they have the permission to analyze and work with this sensitive data to some extend. Therefore, we build our collegues a web interface (GAE ftw.) to submit their queries, which does essentially two things:

  1. It computes a data lineage of the resulting schema of the query. Based on this lineage, we are able to annotate the columns of the resulting table as sensitive or not sensitive. For instance, if you have a table customer_data that contains a sensitive column name and you submit the query SELECT CONCAT(name, "foo") AS funny_name FROM customer_data, then funny_name should be classified as sensitive.
  2. It augments the input query in order to generate some metadata for the computed cells of the resulting table. For example (very, very simplified), we allow our users to "see" a data cell if it is an aggregate of at least 20 different values. To achieve this, we compute, based on an input query
    SELECT shop_id, AVG(customer_age) FROM customer_data GROUP BY shop_id

    the new query

    SELECT shop_id, AVG(customer_age), COUNT(DISTINCT customer_age) AS hidden_count FROM customer_data GROUP BY shop_id

    and show a resulting column only if hidden_count is larger than 20. (Again, this is extremely simplified, because this is obviously by far not enough to ensure data protection.)

For both of these tasks we are currently using Apache Calcite, but the ZetaSQL dialect contains some parts (e.g. the non-standard EXCEPT in SELECT * EXCEPT(foo)) that are not supported by Calcite and are really hard or even impossible to implement using Calcite. Thus, we were very happy to see that the ZetaSQL analyzer was published here. While we nearly managed to fully implement the lineage computation using ZetaSQL (despite the missing docs), we are not able to modify the resolved AST and unparse it to an SQL statement that we can submit to BigQuery, because SQLBuilder.java and its surrounding classes are not provided.

matthewcbrown commented 4 years ago

Forgot to update this. ResolveAST Nodes now have associated builder classes. Documentation is here: ResolvedNode.java

It's still a clunky for doing whole tree transforms. Something like the DeepCopyVisitor in c++ would probably be needed.

cvonredapt commented 4 years ago

@christophstockhusen, this looks very similar to an effort we are working on. Have you have any success in this regards, yet?

janhicken commented 4 years ago

@cvonredapt I took over working on that issue from @christophstockhusen and came up with #29 to solve this issue.

Actually, the gRPC interfaces are already in place, you just have to extend the Analyzer interface to use it.

Edit: I just noticed, #15 does the same thing.

matthewcbrown commented 3 years ago

Latest release contains a rewriting visitor which can help with the read-modify-write use case. See examples in RewritingVisitorTest.java (sorry, not really any useful documentation yet)