marcboeker / go-duckdb

go-duckdb provides a database/sql driver for the DuckDB database engine.
MIT License
746 stars 110 forks source link

Appender performance hangs (> 10 minutes) when flushing #246

Closed dxh9845 closed 4 months ago

dxh9845 commented 4 months ago

Hello, Thanks for this library! I am seeing very slow performance for inserting rows with the Appender API. When using the Append API to append pairs - about 995,247 rows individual - I see extreme slow down when calling .Flush().

Version: github.com/marcboeker/go-duckdb v1.7.0 DuckDB Library Version: v1.0.0

type (
    RowPairDTO struct {
       IntID int64
       OtherIntID int64
    }
)

func (dbr *duckDbRepository) AddLargeAmountOfRows(ctx context.Context, mappingRunID int32, dtos []RowPairDTO) error {
    const operation = "AddLargeAmountOfRows"
    connector, err := dbr.connAdapter.GetDriverConnector()
    if err != nil {
        return RepositoryError{
            operation: operation,
            err:       fmt.Errorf("failed to get driver connector: %w", err),
        }
    }

    driverConn, err := connector.Connect(ctx)
    if err != nil {
        return RepositoryError{
            operation: operation,
            err:       fmt.Errorf("failed to get driver conn: %w", err),
        }
    }

    defer driverConn.Close()

    appender, err := duckdb.NewAppenderFromConn(driverConn, "", "row_pairs")
    if err != nil {
        return RepositoryError{
            operation: operation,
            err:       fmt.Errorf("failed to create new appender: %w", err),
        }
    }

    defer appender.Close()

    // Append to the table
    for idx, mapping := range dtos {

        err = appender.AppendRow(mapping.IntID, mapping.OtherIntID, mappingRunID)
        if err != nil {
            return RepositoryError{
                operation: operation,
                err:       fmt.Errorf("failed to insert mapping at idx %d into row_pairs table: %w", idx, err),
            }
        }
    }

    // Explicitly flush the appender before closing
    // THIS is the part that hangs.
    err = appender.Flush()
    if err != nil {
        return RepositoryError{
            operation: operation,
            err:       fmt.Errorf("failed to flush appender: %w", err),
        }
    }

    return nil
}

In my debugging, I see the code specifically is slow around these lines. In my rough timing, it takes around 13 minutes for a Macbook Pro M1, but I've also experienced this in a production Kubernetes cluster with more processing power.

Please let me know how I can help debug or improve performance.

dxh9845 commented 4 months ago

One other note: though the Godoc comments suggest you should call .Close on an appender rather than .Flush, the perforance appears roughly the same to me either way.

taniabogatsch commented 4 months ago

Hi @dxh9845, Thanks for the bug report. I could not reproduce this with random data (see below). Please provide a complete self-contained example, including the table schema and the row data, or any self-contained reproduction with dummy data.

This is my attempt at reproducing your issue.

func main() {
    AddLargeAmountOfRows(context.Background())
}

func AddLargeAmountOfRows(ctx context.Context) error {

    start := time.Now()

    connector, err := duckdb.NewConnector("", nil)
    _, err = sql.OpenDB(connector).Exec(`CREATE TABLE row_pairs (c1 BIGINT, c2 BIGINT, c3 BIGINT)`)
    driverConn, _ := connector.Connect(ctx)
    defer driverConn.Close()
    appender, _ := duckdb.NewAppenderFromConn(driverConn, "", "row_pairs")
    defer appender.Close()

    // Append to the table
    for i := 0; i < 996000; i++ {
        err = appender.AppendRow(int64(rand.IntN(996000)), int64(rand.IntN(996000)), int64(rand.IntN(996000)))
        if err != nil {
            fmt.Println(err)
        }
    }

    // Explicitly flush the appender before closing
    // THIS is the part that hangs.
    err = appender.Flush()
    if err != nil {
        fmt.Println(err)
    }

    elapsed := time.Since(start)
    log.Printf("finished, took %s", elapsed)
    return nil
}

Output.

2024/07/14 17:03:40 finished, took 352.199916ms
dxh9845 commented 4 months ago

Thanks for the response. I figured it out - my existing table had ART indices that were slowing performance significantly. My solution was to drop the indices and rebuild after the bulk data was finished.

taniabogatsch commented 4 months ago

Great to hear. That's usually our recommendation for indexes, as row-based index key insertions are slow. Any insertions happening post-index creation are currently implemented as row-based index key insertions, also when using the appender. For FKs, we're improving their performance (see here). However, the fastest way to build indexes will always be bulk creation after table creation.