google / zetasql

ZetaSQL - Analyzer Framework for SQL
Apache License 2.0
2.28k stars 214 forks source link

Java version: How to support Function namepath with size > 1 #108

Open tha23rd opened 2 years ago

tha23rd commented 2 years ago

Hi all -

We are trying to add our UDFs as functions to a zetaSQL catalog. While we are able to successfully add them, when we attempt to parse SQL that uses them, we always get "function not found". However, this seems to only be the case if the namepath size is > 1. I.e "udf.stringLength" vs "stringLength"

I have two test cases to show this:

Doesn't work:

    @Test
    fun canParseQueryWithUDF() {
        val func = Function(
                ImmutableList.of("proj","dataset", "test_func"),
                "udf",
                ZetaSQLFunctions.FunctionEnums.Mode.SCALAR,
                listOf(
                    FunctionSignature(
                        FunctionArgumentType(TypeFactory.createSimpleType(ZetaSQLType.TypeKind.TYPE_INT64)),
                        ImmutableList.of(),
                        -1
                    )
                )
        )
        val catalog = SimpleCatalog("test_catalog")
        catalog.addFunction(func)
        val testSql = "select `proj.dataset.test_func`()"
        val analyzer = Analyzer(SQLAnalyzer.buildOptions(), catalog)
        analyzer.analyzeStatement(testSql)
    }

com.google.zetasql.SqlException: Function not found: 'proj.dataset.test_func' [at 1:8]

Doesn't work:

    @Test
    fun canParseQueryWithUDF() {
        val func = Function(
                ImmutableList.of("proj.dataset.test_func"),
                "udf",
                ZetaSQLFunctions.FunctionEnums.Mode.SCALAR,
                listOf(
                    FunctionSignature(
                        FunctionArgumentType(TypeFactory.createSimpleType(ZetaSQLType.TypeKind.TYPE_INT64)),
                        ImmutableList.of(),
                        -1
                    )
                )
        )
        val catalog = SimpleCatalog("test_catalog")
        catalog.addFunction(func)
        val testSql = "select `proj`.dataset.test_func()"
        val analyzer = Analyzer(SQLAnalyzer.buildOptions(), catalog)
        analyzer.analyzeStatement(testSql)
    }

com.google.zetasql.SqlException: Function not found: 'proj.dataset.test_func' [at 1:8] Works:

    @Test
    fun canParseQueryWithUDF() {
        val func = Function(
                ImmutableList.of("test_func"),
                "udf",
                ZetaSQLFunctions.FunctionEnums.Mode.SCALAR,
                listOf(
                    FunctionSignature(
                        FunctionArgumentType(TypeFactory.createSimpleType(ZetaSQLType.TypeKind.TYPE_INT64)),
                        ImmutableList.of(),
                        -1
                    )
                )
        )
        val catalog = SimpleCatalog("test_catalog")
        catalog.addFunction(func)
        val testSql = "select `test_func`()"
        val analyzer = Analyzer(SQLAnalyzer.buildOptions(), catalog)
        analyzer.analyzeStatement(testSql)
    }

As a follow up, do you all do any additional processing of queries before parsing them with zetaSQL? It seems to get consistent error free parsing we need to ensure that table refs are quoted in a very specific way

`my-proj.dataset.my_table` vs `my-proj`.dataset.my_table vs dataset.my_table

For the above, only the first seems to consistently work (thus we process each query to ensure its table refs look like that)

matthewcbrown commented 2 years ago

Yeah, this is pretty terrible, here is what's going on.

First, the language point of view: When you have a dotted identifier like "a.b.Fn()", the language see this as a path [a,b,fn]. However, each catalog can interpret this path however it wants. This could be a function called "a.b.fn" a path "a.b" with a function called "fn", [a.b, fn], a path "a" with a function call "b.fn" [a, b.fn], etc.

The user can distinguish between these options by adding identifier quoting (backtick): thus a.b.fn is a function called "a.b.fn" and a.b.fn is a path with three literals.

Second, the situation in code The c++ zetasql::Catalog interface is pretty generic, you give it a name path, and it gives you a function. But SimpleCatalog (which implements zetasql::Catalog) in c++ and java has a 'convenient' sub-catalog mechanism that interprets name paths as lookups in a subcatalogs.

Now, confusingly, Function objects name paths usually need to know their full paths in the catalogs, but this isn't verified to be consistent, so you can end up with really weird behavior. So, hopefully that explains why all that weirdness is happening. I think the solution for you is to use sub-catalogs.

I tested out some changes to our AnalyzerTest.java to hopefully demonstrate all this behavior (including how to construct sub-catalogs and add a function to it).

  @Test
  public void testFunctionWithDottedName() {
    FunctionArgumentType stringType =
        new FunctionArgumentType(TypeFactory.createSimpleType(TypeKind.TYPE_STRING));
    Function func =
        new Function(
            "p1.p2.fn",
            "TestGoogleSQL",
            Mode.SCALAR,
            ImmutableList.of(new FunctionSignature(stringType, ImmutableList.of(stringType), /* contextId= */ -1)));

    SimpleCatalog catalog = new SimpleCatalog("");
    catalog.addFunction(func);

    LanguageOptions options = new LanguageOptions();
    AnalyzerOptions analyzerOptions = new AnalyzerOptions();
    analyzerOptions.setLanguageOptions(options);
    Analyzer analyzer = new Analyzer(analyzerOptions, catalog);

    ResolvedStatement resolvedStatement;

    resolvedStatement = analyzer.analyzeStatement("SELECT `p1.p2.fn`('')");
    assertThat(resolvedStatement).isNotNull();

    assertThrows(
        SqlException.class,
        () -> analyzer.analyzeStatement("SELECT p1.p2.fn('')"));
  }

  @Test
  public void testFunctionWithNamePath() {
    FunctionArgumentType stringType =
        new FunctionArgumentType(TypeFactory.createSimpleType(TypeKind.TYPE_STRING));
    Function func =
        new Function(
            ImmutableList.of("p1", "p2", "fn"),
            "TestGoogleSQL",
            Mode.SCALAR,
            ImmutableList.of(new FunctionSignature(stringType, ImmutableList.of(stringType), /* contextId= */ -1)));

    SimpleCatalog root = new SimpleCatalog("");
    SimpleCatalog p1 = root.addNewSimpleCatalog("p1");
    SimpleCatalog p2 = p1.addNewSimpleCatalog("p2");
    p2.addFunction(func);

    LanguageOptions options = new LanguageOptions();
    AnalyzerOptions analyzerOptions = new AnalyzerOptions();
    analyzerOptions.setLanguageOptions(options);
    Analyzer analyzer = new Analyzer(analyzerOptions, root);

    ResolvedStatement resolvedStatement;

    resolvedStatement = analyzer.analyzeStatement("SELECT p1.p2.fn('')");
    assertThat(resolvedStatement).isNotNull();

    resolvedStatement = analyzer.analyzeStatement("SELECT `p1`.`p2`.`fn`('')");
    assertThat(resolvedStatement).isNotNull();

    assertThrows(
        SqlException.class,
        () -> analyzer.analyzeStatement("SELECT `p1.p2.fn`('')"));
  }
tha23rd commented 2 years ago

@matthewcbrown

Thank you so much for this amazing explanation. I spent a good bit of time essentially reformatting queries using regex to correctly "backtick" our table names and function names, but this is SO much better

So it sounds like to correctly handle all of the various ways our BQ users can reference functions/tables, we'd essentially want to do:

Given the fully qualified table/func name: project.dataset.table Which can be referenced in the following ways:

We'd want to add catalogs in the following manner:

val root = SimpleCatalog("root")
val proj_catalog = root.addNewSimpleCatalog("proj")
val dataset_catalog = proj_catalog.addNewSimpleCatalog("dataset")
// but also add this catalog at the root level to handle the "dataset.table" use case
root.addSimpleCatalog(dataset_catalog)

EDIT: But the above won't actually handle the "`proj.dataset.table`" use case.

After some short testing this looks like it fits the bill. Let me know if I'm missing anything :D

tha23rd commented 2 years ago

@matthewcbrown Sorry for the ping, was wondering if you saw the edit above

"EDIT: But the above won't actually handle the "proj.dataset.table" use case."

Any guidance you could provide here?

matthewcbrown commented 2 years ago

ah, sorry, I didn't see this - for some reason I don't get emails for edits.

Anyway, I don't have a great solution for you, as far as googlesql::SimpleCatalog is concerned the backtick quoted string is different from the non-backtick quoted string - one is a path with 3 elements, the other is a path with one element.

In c++ this is pretty straightforward to handle, you can just override the FindTable to return the same Table object in either case, but from Java, which requires a SimpleCatalog the best i can think of is to add the Table in all places, although, these would look like separate tables to the resolver. If you are running smallish queries that only one use one style, this is probably okay, but if you actually need them to all resolve to the same table, it's going to muck things up.