Closed levakin closed 9 months ago
Thanks @levakin for the PR. Could you please have a look at https://github.com/marcboeker/go-duckdb/actions/runs/7060722450/job/19233036222?pr=134 Also I need some time to review the PR since I'm currently overwhelmed with other tasks.
@marcboeker thanks, I fixed the issue. It was happening due to different memory allocation behaviour on linux and macos. There actually 2 ways to fix it.
calloc
instead of malloc
here cdArr := (*cdata.CArrowArray)(unsafe.Pointer(C.malloc(C.sizeof_struct_ArrowArray)))
.Before the fix malloc
returned non-zero memory chunk for ArrowArray on second call. The ArrowArray.length is checked and it's not 0 on linux (not even the length of previous chunk, just some large number). Then it tries to import this ArrowArray and obviously fails.
@levakin I would prefer to separate the standard driver from additional features like Arrow support or the DuckDB appender API. The DuckDB Appender implementation has a method called NewAppenderFromConn(driverConn driver.Conn, schema string, table string) (*Appender, error)
that is used to initialize a new appender.
Is it possible to do the same for the Apache Arrow interface to keep the default driver free of extensions?
With dbConn, ok := driverConn.(*conn)
you are able to obtain a raw DuckDB database connection.
Have you seen this part of the PR, which does exactly what I'm proposing.
@marcboeker Yes, I can create such interface. I thought it would be a better idea to follow the same library design as https://github.com/jackc/pgx does. Postgres native interface (without any additional interface initializations like Appender) + stdlib wrapper provided as separate package
@levakin As we already have the Appender separated from the main driver, I would prefer a separated Arrow implementation as well.
Maybe it would make sense to get in contact with @phillipleblanc as he has already built an Arrow integration. The goal should be to have one implementation shipped with go-duckdb that satisfies all parties.
@catkins mentioned in #75 that it might make sense to have the Arrow interface as a separate module - so people that didn't need the Arrow interface wouldn't need to pull in the Arrow dependencies (since it can be quite heavyweight). That also meshes with @marcboeker's desire to have the Arrow interface separated from the main driver - but I'll defer to Marc on what he wants to do there.
The approach @levakin took with using prepared statements is better than my implementation in https://github.com/spicehq/go-duckdb/blob/master/connection.go#L36 - which has the drawback of the context not being respected and needing to execute the query synchronously.
What I can do is work on a separate module that wraps go-duckdb
and provides an Arrow interface using the prepared statement approach that @levakin implemented. @marcboeker can then decide to merge the code into this repo (similar to the Appender pattern), or create a new repo (i.e. go-duckdb-arrow
) and import the code there, and tell people that need to Arrow interface to use that module.
EDIT: Actually I'm not sure if the separate module approach will work - both implementations need access to internal functions/state to work properly.
Actually I'm not sure if the separate module approach will work - both implementations need access to internal functions/state to work properly.
if they're in the same tree, could internal
packages be used to share code between them without exposing it to consumers?
I'll be back soon from the holidays and implement proposed changes. I don't think a separate repo would be convenient to use, but a separate module is a good idea
@phillipleblanc I'm fine with including it in the go-duckdb module, but maybe not tightly integrated into the core driver but instead as an add-on like the Appender. That keeps the main driver interface clean but also does not require a second module/dependency. I'd be happy to merge such a variant.
@marcboeker please check again. I've separated most of the new logic in arrow.go
. Changelog is in the latest commit message.
Separating Arrow in a different module would at least require exposing Statements methods. If we can tolerate arrow dependency in one module, it would be easier to maintain IMHO
Hey @levakin, sorry for the delay and thank you for the extensive refactoring. I've also refactored the code to keep driver.Conn
in the function signature instead of any
for the conn. I've moved the C code to the header of the arrow.go to have everything in one place. I also moved the Arrow specific functions from the statement.go to the arrow.go
. Please review.
Next, we should validate the Arrow specific code for any memory leaks (which can easily occur in a CGO environment) and check for correctness/optimizations. As I have no experience with Apache Arrow, I need to understand it first. Additionally, there were some refactorings that were not related to the Arrow feature. I've tried to keep most of them. If anything was lost, I suggest fixing that in a separate PR.
@marcboeker Thanks for the review. Keeping C code in arrow.go
is ok, since there is not much code in there. Initial idea was to follow the approach from the adbc driver which stores these structures in adbc.h
.
Let's discuss the approach with driver.Conn
in the function signature instead of any for the conn. Why do you think it's a necessary step to make type assertion conn, ok := driverConn.(driver.Conn)
, when we can just skip it and pass driverConn from the conn.Raw(func(driverConn any) error {
instead?
Also I think users should be encouraged to use conn.Raw
which returns any
, rather than connector.Connect
returning Conn
, because in this case sql package maintains the pool of connections. That's why I also changed README and test files.
Are you sure sure we need to provide both Query and QueryContext methods in new interfaces?
// Deprecated: Use QueryContext instead.
func (a *Arrow) Query(query string, args ...any) (array.RecordReader, error) {
return a.QueryContext(context.Background(), query, args)
}
AFAIK QueryContext method in the standard library was added after context package was developed, and it's the only reason why it was not included in Query method to follow backwards compatibility. So it will be safe to provide Query method accepting context argument since the beginning.
@marcboeker Do you think having working examples in the README is a bad idea or should it be addressed in a separate PR? The same question is for Connector refactoring, which is stateful but cannot be closed.
Let's discuss the approach with
driver.Conn
in the function signature instead of any for the conn. Why do you think it's a necessary step to make type assertionconn, ok := driverConn.(driver.Conn)
, when we can just skip it and pass driverConn from theconn.Raw(func(driverConn any) error {
instead?
Changing a function signature, even if it does not break things, is unusual in Go without bumping the major release number. Normally you initialize a new Arrow connection with
connector, err := duckdb.NewConnector("", nil)
if err != nil {
...
}
conn, err := connector.Connect(context.Background())
if err != nil {
...
}
defer conn.Close()
// Retrieve Arrow from connection.
ar, err := duckdb.NewArrowFromConn(conn)
which works perfectly fine. In most cases you would not need a conn.Raw(func(driverConn any) {})
when not using the raw DB conn beforehand.
Also I think users should be encouraged to use
conn.Raw
which returnsany
, rather thanconnector.Connect
returningConn
, because in this case sql package maintains the pool of connections. That's why I also changed README and test files.
The idea behind the Connector was to have a convenient way of running an init func, that for example installs necessary plugins.
Are you sure sure we need to provide both Query and QueryContext methods in new interfaces?
I'm fine with removing it. My idea was to keep the interface in sync with the SQL driver and remove Query()
once the SQL driver does it first.
@marcboeker Do you think having working examples in the README is a bad idea or should it be addressed in a separate PR? The same question is for Connector refactoring, which is stateful but cannot be closed.
What do you mean by "working examples" in the README? Are they broken?
Regarding the Connector: why should it not be possible to close it? https://github.com/marcboeker/go-duckdb/blob/495b2cf3dbf7d4716a975f895b560f54886e9813/duckdb.go#L107
Could you please elaborate a bit. Thanks!
@marcboeker ok, then let's keep the same interface. However, changing to the 'any' type won't break backwards compatibility, so a major version won't be needed.
The idea behind the Connector was to have a convenient way of running an init func, that for example installs necessary plugins.
I aggree with you, it's very convenient to have init for every new connection.
I'm fine with removing it. My idea was to keep the interface in sync with the SQL driver and remove Query() once the SQL driver does it first.
I guess sql package won't remove Query until GO 2, which is unlikely. So I suggest we leave only Query(ctx, ...) or QueryContext(ctx, ...). Which option do you prefer?
Regarding the Connector: why should it not be possible to close it?
Well, it's kinda possible, but too hacky. connector
type is hidden behind driver.Connector.
func (Driver) OpenConnector(dataSourceName string) (driver.Connector, error) {
return createConnector(dataSourceName, func(execerContext driver.ExecerContext) error { return nil })
}
And to get access to Close method it's needed to make type assertion.
connector, err := duckdb.NewConnector("", nil)
if err != nil {
...
}
connector.(io.Closer).Close() // Hidden behind driver.Connector returned by NewConnector
What do you mean by "working examples" in the README? Are they broken?
I meant, that you can't just copy them to your IDE and expect to be run as is. With the changes I proposed, I had advantages like syntax highlighting, autoformatting, etc. like with normal code. If you prefer examples like they are now, it's fine by me.
Let me give more concrete example regarding connector.Close. Let's take README example.
connector, err := NewConnector("test.db", nil)
if err != {
...
}
conn, err := connector.Connect(context.Background())
if err != {
...
}
defer conn.Close()
// Retrieve appender from connection (note that you have to create the table 'test' beforehand).
appender, err := NewAppenderFromConn(conn, "", "test")
if err != {
...
}
defer appender.Close()
Here connector is opened and Database is opened internally. The bug here is that connector is never closed. Unless it's passed to db := sql.OpenDB()
and then closed with db.Close()
.
Here's part of sql package code:
func (db *DB) Close() error {
....
if c, ok := db.connector.(io.Closer); ok {
err1 := c.Close()
if err1 != nil {
err = err1
}
}
We end up with 2 options, leave it as is and work with database only through standard sql package or expose connector Close method to the outer world. It can be done by returning ConnectorCloser interface. Also in this case we should ensure Connector is fine with multiple close calls.
I guess sql package won't remove Query until GO 2, which is unlikely. So I suggest we leave only Query(ctx, ...) or QueryContext(ctx, ...). Which option do you prefer?
Okay, then let's remove Query
as QueryContext
is also used in the regular driver and is the new standard.
Regarding the Connector: Thanks for the detailed explanation. You probably refer to this issue? https://github.com/golang/go/issues/41790
I would suggest to first merge the Arrow feature and then take care of implementing the io.Closer
interface in a second PR.
I meant, that you can't just copy them to your IDE and expect to be run as is. With the changes I proposed, I had advantages like syntax highlighting, autoformatting, etc. like with normal code. If you prefer examples like they are now, it's fine by me.
Okay, got it. What do you think of keeping the samples in the README short and concise and adding two new examples as Go files for the Arrow and Appender to examples/
? We can then also link from the README to the examples.
You are right about the the issue. Here's this check https://go-review.googlesource.com/c/go/+/258360/6/src/database/sql/sql.go from https://go-review.googlesource.com/c/go/+/258360
@marcboeker Yeah, sure. I can file a PR and let's resolve it separately.
Next, we should validate the Arrow specific code for any memory leaks (which can easily occur in a CGO environment) and check for correctness/optimizations. As I have no experience with Apache Arrow, I need to understand it first.
I would appreciate some help with it :)
@marcboeker Yeah, sure. I can file a PR and let's resolve it separately.
Great, thanks!
I would appreciate some help with it :)
I've done some simplifications as C.calloc
already returns an unsafe.Pointer
. Do you see any impacts?
And one test case didn't return anything as there were no rows in the DB. I've added some test data.
If you don't see any issues, I would suggest merging it.
I don't see any impact. However type casting C pointers is cumbersome and we can improve later if needed.
Thanks! Let's merge it @marcboeker 👍🏻
@levakin Thanks for your hard work on this 🙏 It's merged. Will release a new version when #143 is merged.
This PR implements Connection method QueryArrowContext to get query result in Apache Arrow format. Connection type is now exported and can be used in similar way as pgx driver. See example here.
Partly implements #75 , some methods are still missing.