marcboeker / go-duckdb

go-duckdb provides a database/sql driver for the DuckDB database engine.
MIT License
623 stars 97 forks source link

refactor Connector to expose Close method #151

Closed levakin closed 6 months ago

levakin commented 7 months ago

At the moment Connector is stateful but cannot be closed explicitly. connector type is hidden behind driver.Connector.

func (Driver) OpenConnector(dataSourceName string) (driver.Connector, error) {
    return createConnector(dataSourceName, func(execerContext driver.ExecerContext) error { return nil })
}

And to get access to Close method it's needed to make type assertion.

connector, err := duckdb.NewConnector("", nil)
if err != nil {
  ...
}
connector.(io.Closer).Close() // Hidden behind driver.Connector returned by NewConnector

Let me give more concrete example regarding connector.Close. Let's take README example.

connector, err := NewConnector("test.db", nil)
if err != {
  ...
}
conn, err := connector.Connect(context.Background())
if err != {
  ...
}
defer conn.Close()

// Retrieve appender from connection (note that you have to create the table 'test' beforehand).
appender, err := NewAppenderFromConn(conn, "", "test")
if err != {
  ...
}
defer appender.Close()

Here connector is opened and Database is opened internally. The bug here is that connector is never closed. Unless it's passed to db := sql.OpenDB() and then closed with db.Close().

Here's part of sql package code:

func (db *DB) Close() error {
....
if c, ok := db.connector.(io.Closer); ok {
        err1 := c.Close()
        if err1 != nil {
            err = err1
        }
    }

We end up with 2 options, leave it as is and work with database only through standard sql package or expose connector Close method to the outer world. It can be done by returning *Connector. Also in this case we should ensure Connector is fine with multiple close calls.

https://github.com/golang/go/issues/41790 Here's this check https://go-review.googlesource.com/c/go/+/258360/6/src/database/sql/sql.go from https://go-review.googlesource.com/c/go/+/258360

Returning concrete type *Connector is suggested by CodeReviewComments https://go.dev/wiki/CodeReviewComments#interfaces

marcboeker commented 7 months ago

@levakin Thanks for the PR. I've dug a little deeper under the hood and am quite sure, that we do not need to return *Connector to make sure it's closed. The db.Close() method does a cast to io.Closer automatically.

When debugging theTestArrow -> "select series" test in the main branch and adding a breakpoint in the sql.go after the cast to io.Closer is made, the breakpoint is visited.

Screenshot 2024-01-23 at 00 28 17

And directly after the breakpoint in the connector.Close() method is triggered as well.

Screenshot 2024-01-23 at 00 28 25

So to my understanding, the connector is closed correctly.

What is missing, though, is the call to duckdb_destroy_config, which you have added. We should definitely merge this piece.

What do you think?

marcboeker commented 7 months ago

I've added the duckdb_destroy_config to the PR, where the config preparation was refactored.

levakin commented 7 months ago

@marcboeker I agree that there is an option for the developer to guess about type casting. But I'm still convinced that it shouldn't be hidden. It should be easy to do the right thing and then package would have good interface to work with. The change also follows guidelines encouraged by the community aka code review comments

taniabogatsch commented 7 months ago

I favor the changes proposed in this PR - DuckDB's C API does not mind if a nullptr is passed to duckdb_close during multiple close calls. As highlighted in @levakin's PR message, DuckDB requires close calls in multiple situations to clean up memory properly. The proposed changes make that pattern more explicit to the user.

On another note, you (cc @marcboeker) added some of the changes in this PR to #149, which are now on main. When running a leak profiler on the main branch, TestArrow still produces two leaks related to duckdb_open_ext. I think we are not de-allocating the configuration correctly yet. Running the same profiler on this PR shows that the leaks are fixed.

Do you want me to open a separate PR to fix these leaks, or should we merge them as part of this PR?

Main branch. Screenshot 2024-01-24 at 16 48 52

This PR. Screenshot 2024-01-24 at 16 47 17

On another note, OS_xpc_date can be ignored, and I am still looking into the leak caused by runtime.asmcgocall.abi0 as part of _cgo_4daf9482456f_Cfunc__Cmalloc, which repeats in many go-duckdb tests.

taniabogatsch commented 7 months ago

TLDR: I looked more into this, also w.r.t. #153. We do not need to keep the configuration alive after opening a DuckDB database. However, it is crucial to explicitly close the connector, as we leak memory otherwise. This is possible with defer c.(io.Closer).Close(), but preferably, we expose a Close function, as addressed in the PR.

I am currently finishing up a PR that (1) uses the changes of this PR to Close connectors correctly, (2) destroys the configuration object correctly, as addressed in #153 and #151, and (3) addresses a few other leaks.

Now, my more in-depth analysis of this.

@marcboeker, here, you say that the breakpoint in TestArrow enters Close, which closes the DuckDB database.

func (c *connector) Close() error {
    C.duckdb_close(c.db)
    c.db = nil
    return nil
}

However, we are only closing the in-memory database that we opened with db := openDB(t) followed by defer db.Close(). This is not the same as the in-memory databases that we connect to in the respective "select series" and "select long series" tests. These are separate in-memory databases created with their own DSN and var db C.duckdb_database pointers. And we never close them properly.

Given the following extracted test, we never hit func (c *connector) Close() error when debugging.

func TestConnectorLeaks(t *testing.T) {
    t.Parallel()

    t.Run("select series", func(t *testing.T) {
        c, err := NewConnector("", nil)
        require.NoError(t, err)
        // defer c.(io.Closer).Close()

        conn, err := c.Connect(context.Background())
        require.NoError(t, err)
        defer conn.Close()

        ar, err := NewArrowFromConn(conn)
        require.NoError(t, err)

        rdr, err := ar.QueryContext(context.Background(), "SELECT * FROM generate_series(1, 10)")
        require.NoError(t, err)
        defer rdr.Release()

        for rdr.Next() {
            rec := rdr.Record()
            require.Equal(t, int64(10), rec.NumRows())
            require.NoError(t, err)
        }

        require.NoError(t, rdr.Err())
    })

    t.Run("select long series", func(t *testing.T) {
        c, err := NewConnector("", nil)
        require.NoError(t, err)
        // defer c.(io.Closer).Close()

        conn, err := c.Connect(context.Background())
        require.NoError(t, err)
        defer conn.Close()

        ar, err := NewArrowFromConn(conn)
        require.NoError(t, err)

        rdr, err := ar.QueryContext(context.Background(), "SELECT * FROM generate_series(1, 10000)")
        require.NoError(t, err)
        defer rdr.Release()

        var totalRows int64
        for rdr.Next() {
            rec := rdr.Record()
            totalRows += rec.NumRows()
        }

        require.Equal(t, int64(10000), totalRows)

        require.NoError(t, rdr.Err())
    })
}

We can even see this in the leak profiler, which runs the test multiple times - we leak in duckdb_open_ext. In the above example, I commented out defer c.(io.Closer).Close(). If we don't comment defer c.(io.Closer).Close() out, i.e., if we explicitly call Close on the connector, we don't leak.

Without defer c.(io.Closer).Close().

Screenshot 2024-01-25 at 13 09 24

With defer c.(io.Closer).Close().

Screenshot 2024-01-25 at 13 10 08
marcboeker commented 6 months ago

However, we are only closing the in-memory database that we opened with db := openDB(t) followed by defer db.Close(). This is not the same as the in-memory databases that we connect to in the respective "select series" and "select long series" tests. These are separate in-memory databases created with their own DSN and var db C.duckdb_database pointers. And we never close them properly.

I've used the "select series" test just to check if the Close method of the connector is called, which is true. I have not verified that all opened connections in the test were closed properly. However, since the Close method is now exposed, we can control the closing manually.