google / zetasql

ZetaSQL - Analyzer Framework for SQL
Apache License 2.0
2.28k stars 214 forks source link

How to map UDFs to Functions in ZetaSQL #106

Closed tha23rd closed 2 years ago

tha23rd commented 2 years ago

Hi!

When creating the Catalog in zetaSQL we are trying to add our UDFs so that they can be parsed correctly. However, we aren't fully sure about how to map to the various fields exposed in the Function constructor.

Say we have some UDF defined as: stringLength(String input) -> int, in a dataset called functions

How would we create this function in zetaSQL? Specifically I'm confused about what the group and mode are. Also, why can the namePath be a list?

We are using the Java version

Thanks

matthewcbrown commented 2 years ago

FYI, here is the definition of the builtin functions in our c++ code: https://github.com/google/zetasql/blob/1933acefe8eb10e293309a9fa7d7c7f1cb6bf8c4/zetasql/common/builtin_function_internal_3.cc#L121

It's going to be pretty useless without doing a lot of C++ tracing and weird initializer list mapping, but wanted to link it in case it was useful.

Question: What is 'group' for:
This isn't really used for anything in the googlesql codebase, other than identifying which functions are builtin functions. This is meant for engine-builtin functions, in case having some tag is useful. So, you can set it to anything.

What is mode: oneof SCALAR, AGGREGATE or ANALYTIC. SCALAR functions are "normal" functions that accept some arguments, and produce a single output, so, stingLength would be scalar AGGREGATE are things like 'SUM', they conceptually take a set of arguments (i.e. one per row) and produce single output. ANALYTIC functions give me a headache, but you can see the bigquery docs about them here: https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts, but are like crazy aggregate funtions.

why is namePath a list. This is basically to support 'namespaces' similar to java or python package path. So if you wanted discord.functions.stringlength(...) or something like that, this would be a list of ["discord","functions","stringlength"].

(I recommend putting functions in a 'namespace' to avoid naming conflicts with the builtin functions).

You could also create the builtin function catalog and try to debug the structure of an analogous function like 'length' (we have some problems with our debugString functions crashing, but it might be possible to print a function without problems, or maybe an ide would have better luck, I haven't tried it)

tha23rd commented 2 years ago

@matthewcbrown Thanks for the response!

That all makes sense. I guess my broader question is how to we map our user created routines in BigQuery to Functions in zetaSQL so that we can parse queries containing these functions.

The biggest issue I see is that many of our routines are pure SQL and so BigQuery does type inference at runtime for both parameters and the returned value. It seems to me (unless I'm missing something) that zetaSQL requires explicit types for the full signature of the Function.

For our use case, we don't actually care about UDFs to parse our column lineage, so wondering if there's either:

A. A way we can bypass this and have zetaSQL ignore parsing UDFs B. A way to properly build Functions using the information that the BigQuery API gives us for routines

Thanks a ton!