Feature: query with Apache Arrow result

levakin commented 11 months ago

This PR implements Connection method QueryArrowContext to get query result in Apache Arrow format. Connection type is now exported and can be used in similar way as pgx driver. See example here.

Partly implements #75 , some methods are still missing.

marcboeker commented 11 months ago

Thanks @levakin for the PR. Could you please have a look at https://github.com/marcboeker/go-duckdb/actions/runs/7060722450/job/19233036222?pr=134 Also I need some time to review the PR since I'm currently overwhelmed with other tasks.

levakin commented 11 months ago

@marcboeker thanks, I fixed the issue. It was happening due to different memory allocation behaviour on linux and macos. There actually 2 ways to fix it.

First option is to check how many rows were retrieved already and stop if there are no more.
The second option is to use calloc instead of malloc here cdArr := (*cdata.CArrowArray)(unsafe.Pointer(C.malloc(C.sizeof_struct_ArrowArray))).

Before the fix malloc returned non-zero memory chunk for ArrowArray on second call. The ArrowArray.length is checked and it's not 0 on linux (not even the length of previous chunk, just some large number). Then it tries to import this ArrowArray and obviously fails.

marcboeker commented 10 months ago

@levakin I would prefer to separate the standard driver from additional features like Arrow support or the DuckDB appender API. The DuckDB Appender implementation has a method called NewAppenderFromConn(driverConn driver.Conn, schema string, table string) (*Appender, error) that is used to initialize a new appender. Is it possible to do the same for the Apache Arrow interface to keep the default driver free of extensions?

With dbConn, ok := driverConn.(*conn) you are able to obtain a raw DuckDB database connection.

Have you seen this part of the PR, which does exactly what I'm proposing.

levakin commented 10 months ago

@marcboeker Yes, I can create such interface. I thought it would be a better idea to follow the same library design as https://github.com/jackc/pgx does. Postgres native interface (without any additional interface initializations like Appender) + stdlib wrapper provided as separate package

marcboeker commented 10 months ago

@levakin As we already have the Appender separated from the main driver, I would prefer a separated Arrow implementation as well.

Maybe it would make sense to get in contact with @phillipleblanc as he has already built an Arrow integration. The goal should be to have one implementation shipped with go-duckdb that satisfies all parties.

phillipleblanc commented 10 months ago

@catkins mentioned in #75 that it might make sense to have the Arrow interface as a separate module - so people that didn't need the Arrow interface wouldn't need to pull in the Arrow dependencies (since it can be quite heavyweight). That also meshes with @marcboeker's desire to have the Arrow interface separated from the main driver - but I'll defer to Marc on what he wants to do there.

The approach @levakin took with using prepared statements is better than my implementation in https://github.com/spicehq/go-duckdb/blob/master/connection.go#L36 - which has the drawback of the context not being respected and needing to execute the query synchronously.

What I can do is work on a separate module that wraps go-duckdb and provides an Arrow interface using the prepared statement approach that @levakin implemented. @marcboeker can then decide to merge the code into this repo (similar to the Appender pattern), or create a new repo (i.e. go-duckdb-arrow) and import the code there, and tell people that need to Arrow interface to use that module.

EDIT: Actually I'm not sure if the separate module approach will work - both implementations need access to internal functions/state to work properly.

catkins commented 10 months ago

Actually I'm not sure if the separate module approach will work - both implementations need access to internal functions/state to work properly.

if they're in the same tree, could internal packages be used to share code between them without exposing it to consumers?

levakin commented 9 months ago

I'll be back soon from the holidays and implement proposed changes. I don't think a separate repo would be convenient to use, but a separate module is a good idea

marcboeker commented 9 months ago

@phillipleblanc I'm fine with including it in the go-duckdb module, but maybe not tightly integrated into the core driver but instead as an add-on like the Appender. That keeps the main driver interface clean but also does not require a second module/dependency. I'd be happy to merge such a variant.

levakin commented 9 months ago

@marcboeker please check again. I've separated most of the new logic in arrow.go. Changelog is in the latest commit message.

Separating Arrow in a different module would at least require exposing Statements methods. If we can tolerate arrow dependency in one module, it would be easier to maintain IMHO

marcboeker commented 9 months ago

Hey @levakin, sorry for the delay and thank you for the extensive refactoring. I've also refactored the code to keep driver.Conn in the function signature instead of any for the conn. I've moved the C code to the header of the arrow.go to have everything in one place. I also moved the Arrow specific functions from the statement.go to the arrow.go. Please review.

Next, we should validate the Arrow specific code for any memory leaks (which can easily occur in a CGO environment) and check for correctness/optimizations. As I have no experience with Apache Arrow, I need to understand it first. Additionally, there were some refactorings that were not related to the Arrow feature. I've tried to keep most of them. If anything was lost, I suggest fixing that in a separate PR.

levakin commented 9 months ago

@marcboeker Thanks for the review. Keeping C code in arrow.go is ok, since there is not much code in there. Initial idea was to follow the approach from the adbc driver which stores these structures in adbc.h.

Let's discuss the approach with driver.Conn in the function signature instead of any for the conn. Why do you think it's a necessary step to make type assertion conn, ok := driverConn.(driver.Conn), when we can just skip it and pass driverConn from the conn.Raw(func(driverConn any) error { instead?

Also I think users should be encouraged to use conn.Raw which returns any, rather than connector.Connect returning Conn, because in this case sql package maintains the pool of connections. That's why I also changed README and test files.

Are you sure sure we need to provide both Query and QueryContext methods in new interfaces?

// Deprecated: Use QueryContext instead.
func (a *Arrow) Query(query string, args ...any) (array.RecordReader, error) {
    return a.QueryContext(context.Background(), query, args)
}

AFAIK QueryContext method in the standard library was added after context package was developed, and it's the only reason why it was not included in Query method to follow backwards compatibility. So it will be safe to provide Query method accepting context argument since the beginning.

levakin commented 9 months ago

@marcboeker Do you think having working examples in the README is a bad idea or should it be addressed in a separate PR? The same question is for Connector refactoring, which is stateful but cannot be closed.

marcboeker commented 9 months ago

Let's discuss the approach with driver.Conn in the function signature instead of any for the conn. Why do you think it's a necessary step to make type assertion conn, ok := driverConn.(driver.Conn), when we can just skip it and pass driverConn from the conn.Raw(func(driverConn any) error { instead?

Changing a function signature, even if it does not break things, is unusual in Go without bumping the major release number. Normally you initialize a new Arrow connection with

connector, err := duckdb.NewConnector("", nil)
if err != nil {
  ...
}
conn, err := connector.Connect(context.Background())
if err != nil {
  ...
}
defer conn.Close()

// Retrieve Arrow from connection.
ar, err := duckdb.NewArrowFromConn(conn)

which works perfectly fine. In most cases you would not need a conn.Raw(func(driverConn any) {}) when not using the raw DB conn beforehand.

Also I think users should be encouraged to use conn.Raw which returns any, rather than connector.Connect returning Conn, because in this case sql package maintains the pool of connections. That's why I also changed README and test files.

The idea behind the Connector was to have a convenient way of running an init func, that for example installs necessary plugins.

Are you sure sure we need to provide both Query and QueryContext methods in new interfaces?

I'm fine with removing it. My idea was to keep the interface in sync with the SQL driver and remove Query() once the SQL driver does it first.

marcboeker commented 9 months ago

@marcboeker Do you think having working examples in the README is a bad idea or should it be addressed in a separate PR? The same question is for Connector refactoring, which is stateful but cannot be closed.

What do you mean by "working examples" in the README? Are they broken?

Regarding the Connector: why should it not be possible to close it? https://github.com/marcboeker/go-duckdb/blob/495b2cf3dbf7d4716a975f895b560f54886e9813/duckdb.go#L107

Could you please elaborate a bit. Thanks!

levakin commented 9 months ago

@marcboeker ok, then let's keep the same interface. However, changing to the 'any' type won't break backwards compatibility, so a major version won't be needed.

The idea behind the Connector was to have a convenient way of running an init func, that for example installs necessary plugins.

I aggree with you, it's very convenient to have init for every new connection.

I'm fine with removing it. My idea was to keep the interface in sync with the SQL driver and remove Query() once the SQL driver does it first.

I guess sql package won't remove Query until GO 2, which is unlikely. So I suggest we leave only Query(ctx, ...) or QueryContext(ctx, ...). Which option do you prefer?

Regarding the Connector: why should it not be possible to close it?

Well, it's kinda possible, but too hacky. connector type is hidden behind driver.Connector.

func (Driver) OpenConnector(dataSourceName string) (driver.Connector, error) {
    return createConnector(dataSourceName, func(execerContext driver.ExecerContext) error { return nil })
}

And to get access to Close method it's needed to make type assertion.

connector, err := duckdb.NewConnector("", nil)
if err != nil {
  ...
}
connector.(io.Closer).Close() // Hidden behind driver.Connector returned by NewConnector

What do you mean by "working examples" in the README? Are they broken?

I meant, that you can't just copy them to your IDE and expect to be run as is. With the changes I proposed, I had advantages like syntax highlighting, autoformatting, etc. like with normal code. If you prefer examples like they are now, it's fine by me.

levakin commented 9 months ago

Let me give more concrete example regarding connector.Close. Let's take README example.

connector, err := NewConnector("test.db", nil)
if err != {
  ...
}
conn, err := connector.Connect(context.Background())
if err != {
  ...
}
defer conn.Close()

// Retrieve appender from connection (note that you have to create the table 'test' beforehand).
appender, err := NewAppenderFromConn(conn, "", "test")
if err != {
  ...
}
defer appender.Close()

Here connector is opened and Database is opened internally. The bug here is that connector is never closed. Unless it's passed to db := sql.OpenDB() and then closed with db.Close().

Here's part of sql package code:

func (db *DB) Close() error {
....
if c, ok := db.connector.(io.Closer); ok {
        err1 := c.Close()
        if err1 != nil {
            err = err1
        }
    }

We end up with 2 options, leave it as is and work with database only through standard sql package or expose connector Close method to the outer world. It can be done by returning ConnectorCloser interface. Also in this case we should ensure Connector is fine with multiple close calls.

marcboeker commented 9 months ago

I guess sql package won't remove Query until GO 2, which is unlikely. So I suggest we leave only Query(ctx, ...) or QueryContext(ctx, ...). Which option do you prefer?

Okay, then let's remove Query as QueryContext is also used in the regular driver and is the new standard.

Regarding the Connector: Thanks for the detailed explanation. You probably refer to this issue? https://github.com/golang/go/issues/41790

I would suggest to first merge the Arrow feature and then take care of implementing the io.Closer interface in a second PR.

I meant, that you can't just copy them to your IDE and expect to be run as is. With the changes I proposed, I had advantages like syntax highlighting, autoformatting, etc. like with normal code. If you prefer examples like they are now, it's fine by me.

Okay, got it. What do you think of keeping the samples in the README short and concise and adding two new examples as Go files for the Arrow and Appender to examples/? We can then also link from the README to the examples.

levakin commented 9 months ago

You are right about the the issue. Here's this check https://go-review.googlesource.com/c/go/+/258360/6/src/database/sql/sql.go from https://go-review.googlesource.com/c/go/+/258360

@marcboeker Yeah, sure. I can file a PR and let's resolve it separately.

Next, we should validate the Arrow specific code for any memory leaks (which can easily occur in a CGO environment) and check for correctness/optimizations. As I have no experience with Apache Arrow, I need to understand it first.

I would appreciate some help with it :)

marcboeker commented 9 months ago

@marcboeker Yeah, sure. I can file a PR and let's resolve it separately.

Great, thanks!

I would appreciate some help with it :)

I've done some simplifications as C.calloc already returns an unsafe.Pointer. Do you see any impacts?

And one test case didn't return anything as there were no rows in the DB. I've added some test data.

If you don't see any issues, I would suggest merging it.

levakin commented 9 months ago

I don't see any impact. However type casting C pointers is cumbersome and we can improve later if needed.

Thanks! Let's merge it @marcboeker 👍🏻

marcboeker commented 9 months ago

@levakin Thanks for your hard work on this 🙏 It's merged. Will release a new version when #143 is merged.

marcboeker / go-duckdb

Feature: query with Apache Arrow result #134