Open gradient-zero opened 5 years ago
Hey @gradient-zero , its quite lacking of docs, but this project has implemented a formatter API, maybe this (https://github.com/apstndb/zetasql-format-server) can guide you along the way (I'm still wrapping my head around it myself but got it working eventually). I think you might need to downgrade to bazel 0.25
After 'building' zetasql in my local machine, how do I set up the environment to create tables and start running queries?
We are still in the process of releasing this code. It doesn't (yet) include the 'reference implementation' which would allow testing simple queries.
Please correct me if I'm wrong but what I am inferring from this is that zetaSQL in its current form has all the documentation on the format to create tables, write queries etc. but no way to implement them?
@ShreyaGupta08 when you say "create tables, write queries", etc - what do you mean? Zetasql in its current form is an framework for parsing and analyzing SQL statements. So you need to provide it with a catalog of tables, columns and functions to make it work for your case. A good place to start is here: https://github.com/google/zetasql/blob/master/zetasql/analyzer/resolver_test.cc
Currently all that is in ZetaSQL is a parser, it can't execute the SQL queries. One of my coworkers is working on hooking it up to Apache Beam SQL which is able to execute queries. You can take a look here for an example: https://github.com/apache/beam/pull/9210
Can i use the parser inside zetasql as library ?
The parser directory is considered private (as is the parser-based AST). However you can use the methods found in public/parse_helpers.h to do some parsing operations. Running the analyzer also runs the parser and provides an abstract syntax tree you can do more interesting stuff with. See public/analyzer.h. In particular zetasql::AnalyzerStatement can return an AST.
You can also fork it and change the package visibility if you want to play around with the internals of zetasql.
I am playing with zetasql in apstndb/zetasql-format-server and I have recently updated to use tag/2019.07.1. SQLFormatter is now implemented as public visibility so zetasql-format-server is simpler than before and forking zetasql(apstndb/zetasql) is not needed anymore.
If more functions become accessible through public interface and examples are added, it is nice for me.
You can issue (very) simple queries via: bazel run //zetasql/experimental/execute_query -- "SELECT 1"
It's very limited, in particular it doesn't support any data sources (read or write), and most of the functions are lacking, but it's an example of how to tie together the (recently released) evaluator to do some work.
Could some example be provided as a reference ? For example, how a Key-value store such as LevelDB could support ZetaSQL dialect ?
I'm interested in ZetaSQL and me too I think the documentation might be expanded a bit to cover interesting use cases. i.e. what is the rationale behind ZetaSQL ? does it aim to cover some SQL standard ? What query engines might be used with it right now ? What is the canonical way to "glue" an existing query engine ? What examples are available ?
I've tried cloning and building on Ubuntu 18.04 using "bazel build ...", but it errors for me. Are there som instructions or requirements missing?
The error I get is:
ERROR: /home/bjorn_velliv_dk/zetasql/java/com/google/zetasql/BUILD:216:1: Building java/com/google/zetasql/libclient.jar (29 source files) and running annotation processors (AutoAnnotationProcessor, AutoOneOfProcessor, AutoValueProcessor) failed (Exit 1) bazel-out/k8-fastbuild/bin/java/com/google/zetasql/_javac/client/libclient_sourcegenfiles/com/google/zetasql/AutoValue_ParseLocationRange.java:7: error: [strict] Using type javax.annotation.Generated from an indirect dependency (TOOL_INFO: "@javax_annotation_javax_annotation_api//:javax_annotation_javax_annotation_api"). See command below ** @Generated("com.google.auto.value.processor.AutoValueProcessor") ^ ** Please add the following dependencies: @javax_annotation_javax_annotation_api//:javax_annotation_javax_annotation_api to //java/com/google/zetasql:client ** You can use the following buildozer command: buildozer 'add deps @javax_annotation_javax_annotation_api//:javax_annotation_javax_annotation_api' //java/com/google/zetasql:client
I've successfully built on ubuntu 18.04 with stock g++-7, java and python. I have met the very same error as @unusualmutant, the patch has been to modify the file: java/com/google/zetasql/BUILD adding the missing dependency at line 244, as in what follows:
216 java_library(
217 name = "client",
218 srcs = CLIENT_SRCS,
219 deps = [
220 ":channel_provider",
221 ":types",
222 "//zetasql/local_service:local_service_java_grpc",
[snip....]
241 "@com_google_guava_guava//jar",
242 "@com_google_protobuf//:protobuf_java",
243 "@io_grpc_grpc_core//jar",
244 "@javax_annotation_javax_annotation_api//:javax_annotation_javax_annotation_api"
245 ],
246 )
in other words, with git diff you have:
fstrati@fstrati-HP-Laptop-15-db0xxx:~/eclipse-workspace/zetasql$ git diff java/com/google/zetasql/BUILD
diff --git a/java/com/google/zetasql/BUILD b/java/com/google/zetasql/BUILD
index 99c0697..78f0d42 100644
--- a/java/com/google/zetasql/BUILD
+++ b/java/com/google/zetasql/BUILD
@@ -241,6 +241,7 @@ java_library(
"@com_google_guava_guava//jar",
"@com_google_protobuf//:protobuf_java",
"@io_grpc_grpc_core//jar",
+ "@javax_annotation_javax_annotation_api//:javax_annotation_javax_annotation_api"
],
)
Hope that helps :-)
BTW, on my machine the compilation took quite some time and bazel was eager of resources leading to the stall of the window manager, if it seems to get stuck... leave it alone for a coffee break and come back after 20 minutes...
You're a star @fstrati, that worked, thanks a lot!
I apologize in advance for my ignorance. I'm looking for something which is able to take a standard sql statement (currently used in BigQuery), and give me an AST, so that I can work out what table/field inputs are used and what output (names) they map to. I was hoping that zetasql would do the job, but while I'm able to run the test statement:
bazel run //zetasql/experimental:execute_query -- "select 1 + 1;"
I'm unsure how to move on from there. Is there anything in the current implementation that will allow me to do what I described, or do I need to build it myself using zetasql? Or have I completely misunderstood the purpose/functionality of zetasql?
execute_query has a mode to dump the ast as text (--mode=resolve), you can also try --help to see more info. bazel run //zetasql/experimental:execute_query -- --mode=resolve "select 1 + 1;"
However, zetasql's reference implementation (which back 'execute_query') doesn't have the same understanding of tables as BigQuery - i.e. there is no way to actually teach it which tables exist, or what their schema is - other than, perhaps creating dummy tables using 'With' statements: 'WITH Table AS (SELECT CAST(NULL AS STRING) x) SELECT x FROM Table;'
Thank you very much Matthew. Again I must flaunt my lack of comprehension, but it sounds to me like zetasql is not something I can use for the purpose I described? To be of any use, one must integrate zetasql with some sql engine, is that correctly understood? Otherwise I fail to understand what the use is since I can't get the AST for a real-life query like SELECT t1.a from table1 t1.
@unusualmutant If you just want this to work with BigQuery, a dry run (and / or the query statistics if the query has not been retrieved from cache) will give you information about what tables are being referenced.
This is also easily done using the extractTableNamesFromStatement
https://github.com/google/zetasql/blob/aac17f8352ab620afbf6124d7193ccedc79ec786/java/com/google/zetasql/Analyzer.java#L111 as well.
If you also want to extract column names, that gets a bit trickier, as you need to keep track of derived columns from CTEs etc etc. My suggestion here (if you have a big catalog of tables and columns) would probably be to first extract the table names in use, and then look those up in the database catalog. Then, you would map those tables into a SimpleCatalog
(https://github.com/google/zetasql/blob/master/java/com/google/zetasql/SimpleCatalog.java). After that, run a analyzeStatement
https://github.com/google/zetasql/blob/aac17f8352ab620afbf6124d7193ccedc79ec786/java/com/google/zetasql/Analyzer.java#L49 with the catalog.
Then, after that, implement a Visitor (https://github.com/google/zetasql/blob/109c7cdb95d7e16784e05a9293a87b129b5312fd/javatests/com/google/zetasql/VisitorTest.java#L62) to visit the nodes and do something interesting with them.
Maybe a naive approach, but should give you what you want with some trial and error?
Also will add that using zetasql in java (only works on linux in my experience but maybe it's fixed now) is a lot smoother than building from source, you just add it as a maven /gradle dependency https://mvnrepository.com/artifact/com.google.zetasql. Happy to share my setup if anyone is interested.
Thank you very much for that explanation. appreciate it. I would very much like to have your setup if possible.
Thank you very much for that explanation. appreciate it. I would very much like to have your setup if possible.
I will try to get a repo running by the end of this week. In the meantime, its easy enough to get started like this (gradle or maven):
public static void main
as per the java convention:import com.google.zetasql.Analyzer
public class MainClass {
public static void main( String args[] ){
//this will be a list of tables, where each sublist is the components of the table path, i.e
// [[projectID1,datasetId1,tableId1],[projectID2,datasetId2,tableI2]] in the case of a bigquery query
List<List<String>> tables = Analyzer.extractTableNamesFromStatement("SELECT x from y");
// now we can use these to create our catalog and feed that to an analyzer instance
// or we could construct and cache the entire catalog beforehand as a singleton.
}
}
Run the file (easy enough in intelliJ or Eclipse). The only caveat is you need to get this into docker if the linux-only issue is not fixed. But the steps are quite simple here too.
A lot of "go to definition" is encouraged, but the code is quite well organized and written so it shouldnt be a huge hassle. Most of the time will be spent on "glueing" and mapping data types, functions etc into zetasql conventions, which might not be straightforward.
If you are only targeting bigquery, I would use zetasql. If you are targeting multiple DBs, my tip is to look into their own specific parsers, query planners etc to get the best and correct results. They all have subtle differences that make it extremely difficult to have a "master parser". You will be able parse queries that conform to ANSI correctly with zetaSQL, but for any extension or difference, it will fail, which is quite annoying (at least in my experience so far with zetasql).
I'm not sure if Google is using zetasql for anyhting other than bigquery, but I don't think so. that might just my speculation though, so happy if we can get any information from the googlers around here.
Thanks again, you're a huge help. I probably won't have time to try it out before the weekend, so maybe you'll beat me to it :)
I'm only targeting BigQuery, so that is a blessing. I'll try intelliJ on windows, and if it doesn't work, I'll boot up in Ubuntu.
ZetaSQL Java only runs on Linux at the moment.
ZetaSQL is the standard for writing SQL engines inside Google. Externally available products include BigQuery, Spanner, and Dataflow SQL. The current focus of the ZetaSQL effort is to enable these products to release open source tools. Beam SQL (which powers Dataflow SQL) is the first open source use of ZetaSQL, the integration can be found here: https://github.com/apache/beam/tree/master/sdks/java/extensions/sql/zetasql
I have now got it working as a Maven project, using intelliJ on Ubuntu 19.10. I can extract table names and columns running Analyzer.analyzeStatement(...), using Visitors, as suggested by @MartinSahlen. However, I'm having trouble getting information concerning the alias used for columns. In this test query:
select ff_nr as myalias from kfir
I am only able to extract ff_nr as column name. I've tried a few of the many Visitors without finding any alias information. I expected them to be in the ResolvedOutputColumns, but that is empty for me.
Any pointers?
@unusualmutant my go-to is always just printing everything when I try to figure out what is going on. when I have used zetaSQL in C++ the all nodes have a sophisticated debug logging that prints a visual structure with the node types and data in them. If I remember correctly there should be something similar in Java, where the abstract class ResolvedNode
implements a debugString()
method to get a visual structure you can inspect. So you would just take the result of the analyzeStatement
, which would be a concrete implementation of ResolvedNode
and it should print the full parse tree for you.
@MartinSahlen Thanks a lot. I'll look into the possibilities, but I did try calling debugString(), and it threw a "not implemented" exception somewhere in the chain if I remember correctly. So maybe it is not fully implemented yet in the Java client. I'll have a look again.
@unusualmutant Yes that is a problem, the GRPC client / the java code does not yet implement/expose all the functionality present in the c++ code unfortunately :(
Because debugString()
is called by many function(for example toString()
), ZetaSQL will be used more easier when debugString()
is implemented.
Ok, I found the column alias information. It's in ResolvedOutputColumn.getName(), and not ResolvedOutputColumn.getColumn().getName().
But now I'm trying to use the standard sql function CONCAT, and I get an error saying "com.google.zetasql.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Function not found: CONCAT".
Here is the select:
select table_alias.FF_NR as alias1, table_alias.individ_type, CONCAT(table_alias.FF_NR, table_alias.individtype) AS ConcatAlias from kfir_d table_alias
It's such a standard function that it must be supported, but I probably need to do something more to get it to work. Do I need to add it to the catalog?
CONCAT Is actually not supported in the current release, but will be in the next one. I'll try to get that out this week. You can see (sort of indirectly) what is currently supported by looking in: zetasql/reference_impl/function.h The FunctionKind enum lists all the functions suppported (although, they use internal names, rather than SQL names, so you have a squint a little). Sorry for the trouble on this.
@matthewcbrown could you not add the CONCAT (or any other missing) function manually to the catalog as a workaround for missing "expected standard library" functions? To me it seemed that you can build the catalog that way, but I may be wrong here.
Essentially, calling this first:
And then this:
to add whatever function you want to add.
Thank you for the information @matthewcbrown. And yes @MartinSahlen, I can do a workaround like this:
ZetaSQLBuiltinFunctionOptions zetaSQLBuiltinFunctionOptions = new ZetaSQLBuiltinFunctionOptions(); zetaSQLBuiltinFunctionOptions.includeFunctionSignatureId(ZetaSQLFunction.FunctionSignatureId.FN_CONCAT_STRING); catalog.addZetaSQLFunctions(zetaSQLBuiltinFunctionOptions);
I have most of it working now, I am able to get a list of input to output mappings using zetasql. Except for one thing. When my SQL statement contains this:
ROW_NUMBER() OVER (PARTITION BY ...)
I get this error:
Analytic functions not supported
Are they not supported or is is just something I need to add manually to the catalog like the other functions?
Also, re. the other functions, @matthewcbrown wrote that CONCAT was not supported yet, but I found that no functions are supported at all until I add them to the catalog, like "AND", "OR", "<" and ">". Is that expected behavior?
@unusualmutant did you add this to the language features for the analyzer options? Analytic functions are a part of the syntax I think so needs to be added there
In my understanding, yes, you must add ALL language features, options and functions yourself. I guess that makes sense since you then have maximum flexibility but a bit painful to get started. That is why they have the builtinfunction things I assume, to have what is considered to be standard sql as a drop in option.
That did the trick! Thank you. So annoying, I actually had already played with the "enablemaximumfeatures" setting when I couldn't get the functions working, but had then commented it out again :)
@unusualmutant hello; can you share the code you conduct for parsing the sql? I also need looked at zetasql for the same purpose with you actually.
FYI, I'm also playing with ZetaSQL Java wrapper using Kotlin https://github.com/apstndb/zetasql-sandbox
@unusualmutant hello; can you share the code you conduct for parsing the sql? I also need looked at zetasql for the same purpose with you actually.
@muratozcandwh I don't have the code in a state where I'm ready to create a github repo, but if you give me your email address I can send you a zip file with my code. I've created a springboot endpoint for deploying on app engine, and currently it seems to work fine. There is a bug concerning WITH statements, but that'll be fixed soon.
@unusualmutant my email: muratozcanbilkent@gmail.com thank you in advance. If I make any modification I'll also inform you.
@unusualmutant , did I understand correctly, that you're working on native Linux, not in docker? Which release do you use? I tried to build 3 most recent releases with docker on macOS with 16 GB memory, and I'm consistently running into OOM (gcc internal compiler error: Killed)
@padod If you see the discussion previously - he is using the pre-built java version from maven: https://github.com/google/zetasql/issues/4#issuecomment-571957837
Sorry, missed that. In this case, can you please spare me some research by telling which ZetaSQL deps will block me from building project on macOS (via maven, for example)? May be a stupid question, I'm not a pro. Thanks anyway
I believe ZetaSQL doesn't support MacOS yet. Jar has only JNI binary for Linux. See
/zetasql/local_service
in zetasql-jni-channel-2020.01.1.jar
@apstndb, thanks, checked it out, works like a charm! Didn't know about jib. I only think that I'll refactor it to java :)
what is the state of this library and whether or not in can parse bigquery sql? i read through this thread and do not understand if a query that is valid and runnable in bigquery can be parsed with table names retrieved from the query string.
@mx2323 As long as you are not using BigQuery ML in your queries, everything can be parsed just fine. We have not encountered any incompatibilities yet.
A beginner question: How do we run other functions besides //zetasql/experimental:execute_query
?
Hi Team.. am trying to use extracttablenamesfromstatement with java version.. but i keep getting client noot found..
Analyzer.extractTableNamesFromStatement("SELECT x from y") java.lang.IllegalStateException: No ZetaSQL ClientChannelProvider loaded.
not sure if i am missing something.. do i need to instantiate client before and i don't see anything related to client part
Could you easily explain how to use? How to connect the parser to my Command line interface Is there any sample? Thanks