google / zetasql

ZetaSQL - Analyzer Framework for SQL
Apache License 2.0
2.29k stars 216 forks source link

How to use? #4

Open gradient-zero opened 5 years ago

gradient-zero commented 5 years ago

Could you easily explain how to use? How to connect the parser to my Command line interface Is there any sample? Thanks

MartinSahlen commented 5 years ago

Hey @gradient-zero , its quite lacking of docs, but this project has implemented a formatter API, maybe this (https://github.com/apstndb/zetasql-format-server) can guide you along the way (I'm still wrapping my head around it myself but got it working eventually). I think you might need to downgrade to bazel 0.25

ShreyaGupta08 commented 5 years ago

After 'building' zetasql in my local machine, how do I set up the environment to create tables and start running queries?

matthewcbrown commented 5 years ago

We are still in the process of releasing this code. It doesn't (yet) include the 'reference implementation' which would allow testing simple queries.

ShreyaGupta08 commented 5 years ago

Please correct me if I'm wrong but what I am inferring from this is that zetaSQL in its current form has all the documentation on the format to create tables, write queries etc. but no way to implement them?

MartinSahlen commented 5 years ago

@ShreyaGupta08 when you say "create tables, write queries", etc - what do you mean? Zetasql in its current form is an framework for parsing and analyzing SQL statements. So you need to provide it with a catalog of tables, columns and functions to make it work for your case. A good place to start is here: https://github.com/google/zetasql/blob/master/zetasql/analyzer/resolver_test.cc

apilloud commented 5 years ago

Currently all that is in ZetaSQL is a parser, it can't execute the SQL queries. One of my coworkers is working on hooking it up to Apache Beam SQL which is able to execute queries. You can take a look here for an example: https://github.com/apache/beam/pull/9210

vicknite commented 5 years ago

Can i use the parser inside zetasql as library ?

matthewcbrown commented 5 years ago

The parser directory is considered private (as is the parser-based AST). However you can use the methods found in public/parse_helpers.h to do some parsing operations. Running the analyzer also runs the parser and provides an abstract syntax tree you can do more interesting stuff with. See public/analyzer.h. In particular zetasql::AnalyzerStatement can return an AST.

MartinSahlen commented 5 years ago

You can also fork it and change the package visibility if you want to play around with the internals of zetasql.

apstndb commented 5 years ago

I am playing with zetasql in apstndb/zetasql-format-server and I have recently updated to use tag/2019.07.1. SQLFormatter is now implemented as public visibility so zetasql-format-server is simpler than before and forking zetasql(apstndb/zetasql) is not needed anymore.

If more functions become accessible through public interface and examples are added, it is nice for me.

matthewcbrown commented 5 years ago

You can issue (very) simple queries via: bazel run //zetasql/experimental/execute_query -- "SELECT 1"

It's very limited, in particular it doesn't support any data sources (read or write), and most of the functions are lacking, but it's an example of how to tie together the (recently released) evaluator to do some work.

yingfeng commented 4 years ago

Could some example be provided as a reference ? For example, how a Key-value store such as LevelDB could support ZetaSQL dialect ?

fstrati commented 4 years ago

I'm interested in ZetaSQL and me too I think the documentation might be expanded a bit to cover interesting use cases. i.e. what is the rationale behind ZetaSQL ? does it aim to cover some SQL standard ? What query engines might be used with it right now ? What is the canonical way to "glue" an existing query engine ? What examples are available ?

unusualmutant commented 4 years ago

I've tried cloning and building on Ubuntu 18.04 using "bazel build ...", but it errors for me. Are there som instructions or requirements missing?

The error I get is:

ERROR: /home/bjorn_velliv_dk/zetasql/java/com/google/zetasql/BUILD:216:1: Building java/com/google/zetasql/libclient.jar (29 source files) and running annotation processors (AutoAnnotationProcessor, AutoOneOfProcessor, AutoValueProcessor) failed (Exit 1) bazel-out/k8-fastbuild/bin/java/com/google/zetasql/_javac/client/libclient_sourcegenfiles/com/google/zetasql/AutoValue_ParseLocationRange.java:7: error: [strict] Using type javax.annotation.Generated from an indirect dependency (TOOL_INFO: "@javax_annotation_javax_annotation_api//:javax_annotation_javax_annotation_api"). See command below ** @Generated("com.google.auto.value.processor.AutoValueProcessor") ^ ** Please add the following dependencies: @javax_annotation_javax_annotation_api//:javax_annotation_javax_annotation_api to //java/com/google/zetasql:client ** You can use the following buildozer command: buildozer 'add deps @javax_annotation_javax_annotation_api//:javax_annotation_javax_annotation_api' //java/com/google/zetasql:client

fstrati commented 4 years ago

I've successfully built on ubuntu 18.04 with stock g++-7, java and python. I have met the very same error as @unusualmutant, the patch has been to modify the file: java/com/google/zetasql/BUILD adding the missing dependency at line 244, as in what follows:

    216 java_library(
    217     name = "client",
    218     srcs = CLIENT_SRCS,
    219     deps = [
    220         ":channel_provider",
    221         ":types",
    222         "//zetasql/local_service:local_service_java_grpc",
           [snip....]
    241         "@com_google_guava_guava//jar",
    242         "@com_google_protobuf//:protobuf_java",
    243         "@io_grpc_grpc_core//jar",
    244         "@javax_annotation_javax_annotation_api//:javax_annotation_javax_annotation_api"
    245     ],
    246 )

in other words, with git diff you have:

fstrati@fstrati-HP-Laptop-15-db0xxx:~/eclipse-workspace/zetasql$ git diff java/com/google/zetasql/BUILD
diff --git a/java/com/google/zetasql/BUILD b/java/com/google/zetasql/BUILD
index 99c0697..78f0d42 100644
--- a/java/com/google/zetasql/BUILD
+++ b/java/com/google/zetasql/BUILD
@@ -241,6 +241,7 @@ java_library(
         "@com_google_guava_guava//jar",
         "@com_google_protobuf//:protobuf_java",
         "@io_grpc_grpc_core//jar",
+        "@javax_annotation_javax_annotation_api//:javax_annotation_javax_annotation_api"
     ],
 )

Hope that helps :-)

BTW, on my machine the compilation took quite some time and bazel was eager of resources leading to the stall of the window manager, if it seems to get stuck... leave it alone for a coffee break and come back after 20 minutes...

unusualmutant commented 4 years ago

You're a star @fstrati, that worked, thanks a lot!

unusualmutant commented 4 years ago

I apologize in advance for my ignorance. I'm looking for something which is able to take a standard sql statement (currently used in BigQuery), and give me an AST, so that I can work out what table/field inputs are used and what output (names) they map to. I was hoping that zetasql would do the job, but while I'm able to run the test statement:

bazel run //zetasql/experimental:execute_query -- "select 1 + 1;"

I'm unsure how to move on from there. Is there anything in the current implementation that will allow me to do what I described, or do I need to build it myself using zetasql? Or have I completely misunderstood the purpose/functionality of zetasql?

matthewcbrown commented 4 years ago

execute_query has a mode to dump the ast as text (--mode=resolve), you can also try --help to see more info. bazel run //zetasql/experimental:execute_query -- --mode=resolve "select 1 + 1;"

However, zetasql's reference implementation (which back 'execute_query') doesn't have the same understanding of tables as BigQuery - i.e. there is no way to actually teach it which tables exist, or what their schema is - other than, perhaps creating dummy tables using 'With' statements: 'WITH Table AS (SELECT CAST(NULL AS STRING) x) SELECT x FROM Table;'

unusualmutant commented 4 years ago

Thank you very much Matthew. Again I must flaunt my lack of comprehension, but it sounds to me like zetasql is not something I can use for the purpose I described? To be of any use, one must integrate zetasql with some sql engine, is that correctly understood? Otherwise I fail to understand what the use is since I can't get the AST for a real-life query like SELECT t1.a from table1 t1.

MartinSahlen commented 4 years ago

@unusualmutant If you just want this to work with BigQuery, a dry run (and / or the query statistics if the query has not been retrieved from cache) will give you information about what tables are being referenced.

This is also easily done using the extractTableNamesFromStatement https://github.com/google/zetasql/blob/aac17f8352ab620afbf6124d7193ccedc79ec786/java/com/google/zetasql/Analyzer.java#L111 as well.

If you also want to extract column names, that gets a bit trickier, as you need to keep track of derived columns from CTEs etc etc. My suggestion here (if you have a big catalog of tables and columns) would probably be to first extract the table names in use, and then look those up in the database catalog. Then, you would map those tables into a SimpleCatalog (https://github.com/google/zetasql/blob/master/java/com/google/zetasql/SimpleCatalog.java). After that, run a analyzeStatement https://github.com/google/zetasql/blob/aac17f8352ab620afbf6124d7193ccedc79ec786/java/com/google/zetasql/Analyzer.java#L49 with the catalog.

Then, after that, implement a Visitor (https://github.com/google/zetasql/blob/109c7cdb95d7e16784e05a9293a87b129b5312fd/javatests/com/google/zetasql/VisitorTest.java#L62) to visit the nodes and do something interesting with them.

Maybe a naive approach, but should give you what you want with some trial and error?

MartinSahlen commented 4 years ago

Also will add that using zetasql in java (only works on linux in my experience but maybe it's fixed now) is a lot smoother than building from source, you just add it as a maven /gradle dependency https://mvnrepository.com/artifact/com.google.zetasql. Happy to share my setup if anyone is interested.

unusualmutant commented 4 years ago

Thank you very much for that explanation. appreciate it. I would very much like to have your setup if possible.

MartinSahlen commented 4 years ago

Thank you very much for that explanation. appreciate it. I would very much like to have your setup if possible.

I will try to get a repo running by the end of this week. In the meantime, its easy enough to get started like this (gradle or maven):

  1. Add all of these to your gradle or pom: https://mvnrepository.com/artifact/com.google.zetasql
  2. Create a class file that contains a public static void main as per the java convention:
import com.google.zetasql.Analyzer

public class MainClass {
    public static void main( String args[] ){
       //this will be a list of tables, where each sublist is the components of the table path, i.e
      // [[projectID1,datasetId1,tableId1],[projectID2,datasetId2,tableI2]] in the case of a bigquery query
        List<List<String>> tables = Analyzer.extractTableNamesFromStatement("SELECT x from y");

  // now we can use these to create our catalog and feed that to an analyzer instance
// or we could construct and cache the entire catalog beforehand as a singleton.
    }
}

Run the file (easy enough in intelliJ or Eclipse). The only caveat is you need to get this into docker if the linux-only issue is not fixed. But the steps are quite simple here too.

A lot of "go to definition" is encouraged, but the code is quite well organized and written so it shouldnt be a huge hassle. Most of the time will be spent on "glueing" and mapping data types, functions etc into zetasql conventions, which might not be straightforward.

If you are only targeting bigquery, I would use zetasql. If you are targeting multiple DBs, my tip is to look into their own specific parsers, query planners etc to get the best and correct results. They all have subtle differences that make it extremely difficult to have a "master parser". You will be able parse queries that conform to ANSI correctly with zetaSQL, but for any extension or difference, it will fail, which is quite annoying (at least in my experience so far with zetasql).

I'm not sure if Google is using zetasql for anyhting other than bigquery, but I don't think so. that might just my speculation though, so happy if we can get any information from the googlers around here.

unusualmutant commented 4 years ago

Thanks again, you're a huge help. I probably won't have time to try it out before the weekend, so maybe you'll beat me to it :)

I'm only targeting BigQuery, so that is a blessing. I'll try intelliJ on windows, and if it doesn't work, I'll boot up in Ubuntu.

apilloud commented 4 years ago

ZetaSQL Java only runs on Linux at the moment.

ZetaSQL is the standard for writing SQL engines inside Google. Externally available products include BigQuery, Spanner, and Dataflow SQL. The current focus of the ZetaSQL effort is to enable these products to release open source tools. Beam SQL (which powers Dataflow SQL) is the first open source use of ZetaSQL, the integration can be found here: https://github.com/apache/beam/tree/master/sdks/java/extensions/sql/zetasql

unusualmutant commented 4 years ago

I have now got it working as a Maven project, using intelliJ on Ubuntu 19.10. I can extract table names and columns running Analyzer.analyzeStatement(...), using Visitors, as suggested by @MartinSahlen. However, I'm having trouble getting information concerning the alias used for columns. In this test query:

select ff_nr as myalias from kfir

I am only able to extract ff_nr as column name. I've tried a few of the many Visitors without finding any alias information. I expected them to be in the ResolvedOutputColumns, but that is empty for me.

Any pointers?

MartinSahlen commented 4 years ago

@unusualmutant my go-to is always just printing everything when I try to figure out what is going on. when I have used zetaSQL in C++ the all nodes have a sophisticated debug logging that prints a visual structure with the node types and data in them. If I remember correctly there should be something similar in Java, where the abstract class ResolvedNode implements a debugString() method to get a visual structure you can inspect. So you would just take the result of the analyzeStatement, which would be a concrete implementation of ResolvedNode and it should print the full parse tree for you.

unusualmutant commented 4 years ago

@MartinSahlen Thanks a lot. I'll look into the possibilities, but I did try calling debugString(), and it threw a "not implemented" exception somewhere in the chain if I remember correctly. So maybe it is not fully implemented yet in the Java client. I'll have a look again.

MartinSahlen commented 4 years ago

@unusualmutant Yes that is a problem, the GRPC client / the java code does not yet implement/expose all the functionality present in the c++ code unfortunately :(

apstndb commented 4 years ago

Because debugString() is called by many function(for example toString()), ZetaSQL will be used more easier when debugString() is implemented.

unusualmutant commented 4 years ago

Ok, I found the column alias information. It's in ResolvedOutputColumn.getName(), and not ResolvedOutputColumn.getColumn().getName().

But now I'm trying to use the standard sql function CONCAT, and I get an error saying "com.google.zetasql.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Function not found: CONCAT".

Here is the select:

select table_alias.FF_NR as alias1, table_alias.individ_type, CONCAT(table_alias.FF_NR, table_alias.individtype) AS ConcatAlias from kfir_d table_alias

It's such a standard function that it must be supported, but I probably need to do something more to get it to work. Do I need to add it to the catalog?

matthewcbrown commented 4 years ago

CONCAT Is actually not supported in the current release, but will be in the next one. I'll try to get that out this week. You can see (sort of indirectly) what is currently supported by looking in: zetasql/reference_impl/function.h The FunctionKind enum lists all the functions suppported (although, they use internal names, rather than SQL names, so you have a squint a little). Sorry for the trouble on this.

MartinSahlen commented 4 years ago

@matthewcbrown could you not add the CONCAT (or any other missing) function manually to the catalog as a workaround for missing "expected standard library" functions? To me it seemed that you can build the catalog that way, but I may be wrong here.

Essentially, calling this first:

https://github.com/google/zetasql/blob/79adcd0fe227173e68ed7aa88f580a691ebe82c2/java/com/google/zetasql/SimpleCatalog.java#L356

And then this:

https://github.com/google/zetasql/blob/79adcd0fe227173e68ed7aa88f580a691ebe82c2/java/com/google/zetasql/SimpleCatalog.java#L405

to add whatever function you want to add.

unusualmutant commented 4 years ago

Thank you for the information @matthewcbrown. And yes @MartinSahlen, I can do a workaround like this:

ZetaSQLBuiltinFunctionOptions zetaSQLBuiltinFunctionOptions = new ZetaSQLBuiltinFunctionOptions(); zetaSQLBuiltinFunctionOptions.includeFunctionSignatureId(ZetaSQLFunction.FunctionSignatureId.FN_CONCAT_STRING); catalog.addZetaSQLFunctions(zetaSQLBuiltinFunctionOptions);

unusualmutant commented 4 years ago

I have most of it working now, I am able to get a list of input to output mappings using zetasql. Except for one thing. When my SQL statement contains this:

ROW_NUMBER() OVER (PARTITION BY ...)

I get this error:

Analytic functions not supported

Are they not supported or is is just something I need to add manually to the catalog like the other functions?

Also, re. the other functions, @matthewcbrown wrote that CONCAT was not supported yet, but I found that no functions are supported at all until I add them to the catalog, like "AND", "OR", "<" and ">". Is that expected behavior?

MartinSahlen commented 4 years ago

@unusualmutant did you add this to the language features for the analyzer options? Analytic functions are a part of the syntax I think so needs to be added there

(https://github.com/google/zetasql/blob/78defaf98f1255d6e5f4c3eef8cc3d66ca963dfb/javatests/com/google/zetasql/LanguageOptionsTest.java#L51)

In my understanding, yes, you must add ALL language features, options and functions yourself. I guess that makes sense since you then have maximum flexibility but a bit painful to get started. That is why they have the builtinfunction things I assume, to have what is considered to be standard sql as a drop in option.

unusualmutant commented 4 years ago

That did the trick! Thank you. So annoying, I actually had already played with the "enablemaximumfeatures" setting when I couldn't get the functions working, but had then commented it out again :)

muratozcandwh commented 4 years ago

@unusualmutant hello; can you share the code you conduct for parsing the sql? I also need looked at zetasql for the same purpose with you actually.

apstndb commented 4 years ago

FYI, I'm also playing with ZetaSQL Java wrapper using Kotlin https://github.com/apstndb/zetasql-sandbox

unusualmutant commented 4 years ago

@unusualmutant hello; can you share the code you conduct for parsing the sql? I also need looked at zetasql for the same purpose with you actually.

@muratozcandwh I don't have the code in a state where I'm ready to create a github repo, but if you give me your email address I can send you a zip file with my code. I've created a springboot endpoint for deploying on app engine, and currently it seems to work fine. There is a bug concerning WITH statements, but that'll be fixed soon.

muratozcandwh commented 4 years ago

@unusualmutant my email: muratozcanbilkent@gmail.com thank you in advance. If I make any modification I'll also inform you.

padod commented 4 years ago

@unusualmutant , did I understand correctly, that you're working on native Linux, not in docker? Which release do you use? I tried to build 3 most recent releases with docker on macOS with 16 GB memory, and I'm consistently running into OOM (gcc internal compiler error: Killed)

MartinSahlen commented 4 years ago

@padod If you see the discussion previously - he is using the pre-built java version from maven: https://github.com/google/zetasql/issues/4#issuecomment-571957837

padod commented 4 years ago

Sorry, missed that. In this case, can you please spare me some research by telling which ZetaSQL deps will block me from building project on macOS (via maven, for example)? May be a stupid question, I'm not a pro. Thanks anyway

apstndb commented 4 years ago

I believe ZetaSQL doesn't support MacOS yet. Jar has only JNI binary for Linux. See

padod commented 4 years ago

@apstndb, thanks, checked it out, works like a charm! Didn't know about jib. I only think that I'll refactor it to java :)

mx2323 commented 4 years ago

what is the state of this library and whether or not in can parse bigquery sql? i read through this thread and do not understand if a query that is valid and runnable in bigquery can be parsed with table names retrieved from the query string.

janhicken commented 4 years ago

@mx2323 As long as you are not using BigQuery ML in your queries, everything can be parsed just fine. We have not encountered any incompatibilities yet.

mingen-pan commented 4 years ago

A beginner question: How do we run other functions besides //zetasql/experimental:execute_query?

hidarapaneni commented 4 years ago

Hi Team.. am trying to use extracttablenamesfromstatement with java version.. but i keep getting client noot found.. Analyzer.extractTableNamesFromStatement("SELECT x from y") java.lang.IllegalStateException: No ZetaSQL ClientChannelProvider loaded. not sure if i am missing something.. do i need to instantiate client before and i don't see anything related to client part