marcboeker / go-duckdb

go-duckdb provides a database/sql driver for the DuckDB database engine.
MIT License
583 stars 96 forks source link

[Feature] Scalar UDF support #222

Open taniabogatsch opened 1 month ago

taniabogatsch commented 1 month ago

This is a draft PR. I built with duckdb's feature branch, as scalar UDFs are not yet part of the released C API. The test failures are related to that.

Because of that, and because this PR depends on #219, there are a lot of file changes in this PR, as well as separate libraries. They can all be ignored. The relevant files are listed below.

@JAicewizard, I solved passing types differently than you did in #201. What do you think? Is passing SQL type names a sensitive idea? I also tried to solely use the DataChunk API draft's functions.

In later PRs, we can extend this with ExecuteColumn.

Here is the example included in this PR.

type scalarUDF struct {
    err error
}

func (udf *scalarUDF) Config() ScalarFunctionConfig {
    return ScalarFunctionConfig{
        InputTypes: []string{"INT", "INT"},
        ResultType: "INT",
    }
}

func (udf *scalarUDF) ExecuteRow(args []driver.Value) (any, error) {
    if len(args) != 2 {
        return nil, errors.New("expected two values")
    }
    val := args[0].(int32) + args[1].(int32)
    return val, nil
}

func (udf *scalarUDF) SetError(err error) {
    udf.err = err
}

func TestScalarUDFPrimitive(t *testing.T) {
    db, err := sql.Open("duckdb", "")
    require.NoError(t, err)

    c, err := db.Conn(context.Background())
    require.NoError(t, err)

    var udf scalarUDF
    err = RegisterScalarUDF(c, "my_sum", &udf)

    var msg int
    row := db.QueryRow(`SELECT my_sum(10, 42) AS msg`)
    require.NoError(t, row.Scan(&msg))
    require.Equal(t, 52, msg)
    require.NoError(t, db.Close())
}
JAicewizard commented 1 month ago

@JAicewizard, I solved passing types differently than you did in https://github.com/marcboeker/go-duckdb/pull/201. What do you think? Is passing SQL type names a sensitive idea?

I like the idea, however I think it might be difficult to handle composite types. Handling these would require a parser for to go from "struct{x:INT, y:DOUBLE}" to a duckdb type. Personally I also prefer to have any errors be returned when "creating" the types, since then creation of the error is very close to where it problem lies. Another thing to think about is what to do with aliases that duckdb uses, do we want to mirror those, and if we do how will we keep the list updated? (and do we change the name when duckdb does?)

JAicewizard commented 1 month ago

On a seperate note, I like how few methods you need to implement, I think the tablefunctions are very complex to implement. This is kind of inherent to the type of function, but I still dislike it

taniabogatsch commented 1 month ago

On a seperate note, I like how few methods you need to implement, I think the tablefunctions are very complex to implement. This is kind of inherent to the type of function, but I still dislike it

Yes, your PR was a great help to write this! And I noticed the same, there is significantly more logic around the table UDFs.

I like the idea, however I think it might be difficult to handle composite types. Handling these would require a parser for to go from "struct{x:INT, y:DOUBLE}" to a duckdb type.

Another thing to think about is what to do with aliases that duckdb uses, do we want to mirror those, and if we do how will we keep the list updated? (and do we change the name when duckdb does?)

The parser is a fair argument against this. We currently have func logicalTypeName(lt C.duckdb_logical_type) string {...}, which does a bit of that, but it would have to be much improved. Additionally, the second argument is a strong one for me. If we stick to the go types and go-duckdb types, then they match the casts from any in the execution function. Maybe we can draft a Type interface in a separate PR, though? As it is a general concept that we want to introduce for UDFs... And it should integrate well with the existing exposed types... 🤔

Personally I also prefer to have any errors be returned when "creating" the types, since then creation of the error is very close to where it problem lies.

This is independent of which strategy we decide on for the types, no? We can return an error when registering the UDF, if we detect invalid types.

JAicewizard commented 1 month ago

This is independent of which strategy we decide on for the types, no? We can return an error when registering the UDF, if we detect invalid types.

true, but for, for example, table UDFs, the types returned types are only determined at bind, it would turn into a generic sql error. But this is definitely debatable, its just my opinion. A seperate PR could be appropriate, I don't have much time due to exams next week. Feel free to copy my implementation if you want, also keep in mind we can add new "constructors" whenever we want, if some duckdb API comes out to parse a string into a type we can do this. (we can even make type an interface)