proposal: database/sql: support a way to perform bulk actions

the42 commented 11 years ago

The current implementation of database/sql doesn't provide a way to insert multiple rows
at once, without getting to the wire upon every call to

db.Exec

There are APIs outside which either provide a general way for bulk actions

cf.  SQLBulkCopy of ADO.NET 2.0 [1]

or has a specifier upto how many statements should be grouped together

cf. Batch Update of ADO.NET 2.0 DataAdapter Object [1]

or simply supports an array to be bound to Exec, open which Exec iterates internally,
preventing execessive wire communication. [2]

[1] Codeproject, "Multiple Ways to do Multiple Inserts"
http://www.codeproject.com/Articles/25457/Multiple-Ways-to-do-Multiple-Inserts
[2] Python PEP 249 -- Python Database API Specification v2.0
http://www.python.org/dev/peps/pep-0249/#executemany

bradfitz commented 11 years ago

Comment 1:

Labels changed: added priority-later, suggested, removed priority-triage.

Status changed to Accepted.

rsc commented 10 years ago

Comment 2:

Labels changed: added go1.3maybe.

rsc commented 10 years ago

Comment 3:

Labels changed: added release-none, removed go1.3maybe.

rsc commented 10 years ago

Comment 4:

Labels changed: added repo-main.

kardianos commented 10 years ago

Comment 5:

Of note, doing this well would require new database/sql API for both the driver and the
front end code.
Having implemented several protocols that include a bulk copy method, this needs to be
called out as different, as much of the control that is available in a Insert statement
is exposed differently then in SQL.
I'm wary of any suggestions to bind to an array, as arrays are legitimate data types in
several rdbms. I'll be implementing a general interface shortly in the rdb front end.
Send me a line if you'd like to discuss.

bmharper commented 9 years ago

For what it's worth - I've just tried using the lib/pq driver's Copy functionality to do bulk loading, and although I can't comment on whether this would work for other DB drivers, it seems like a reasonable API.

perillo commented 8 years ago

The standard syntax for multi-value INSERT is (from PostgreSQL documentation):

INSERT INTO films (code, title, did, date_prod, kind) VALUES
    ('B6717', 'Tampopo', 110, '1985-02-10', 'Comedy'),
    ('HG120', 'The Dinner Game', 140, DEFAULT, 'Comedy');

What about adding a Value type that can be passed as the args argument to Exec or Query?

As an example:

type Value []interface{} // defined in the sql package

batch := []Value
for i := 0; i < N; i++ {
    batch = append(batch, Value{1, 1.3, "x"})
}
db.Exec("INSERT INTO films (code, title, did, date_prod, kind) VALUES ?", batch)

This will not require any changes to the existing interface.

kostya-sh commented 8 years ago

This will require holding data for the whole batch in memory. Also this proposal doesn't allow to batch updates if supported by database.

FYI there is implementation of batch insert in github.com/lib/pq based on COPY.

perillo commented 8 years ago

On Mon, Mar 7, 2016 at 4:17 PM, kostya-sh notifications@github.com wrote:

This will require holding data for the whole batch in memory.

Of course, this is required by standard INSERT statement.

Also this proposal doesn't allow to batch updates if supported by database.

By batch, do you mean multiple SQL statements in the same query, separated by semicolon?

I will be happy with just the support for multi-value INSERT, since it is directly supported by the SQL standard (using the VALUES construct):

type Tuple []interface{}

type Values []Tuple

FYI there is implementation of batch insert in github.com/lib/pq based on

COPY.

COPY is not standard, and the github.com/lib/pq seems (just looking at the API) to store the whole batch in memory.

kostya-sh commented 8 years ago

By batch, do you mean multiple SQL statements in the same query, separated by semicolon?

Yes, this would require to support multiple return values though.

COPY is not standard, and the github.com/lib/pq seems (just looking at the API) to store the whole batch in memory.

COPY is not standard indeed but it is fast and the driver doesn't hold the whole batch in memory. Have a look at the implementation at https://github.com/lib/pq/blob/master/copy.go

I agree it would be nice to have generic batch API but it is quite difficult to design a single API that will allow drivers to choose the optimal method to implement batched operations. I think using driver library directly is a quite good compromise.

BTW, in postgresql it is also possible to use the following SQL for bulk insert:

INSERT INTO mytable (col1, col2, col3) VALUES (unnest(?), unnest(?), unnest(?))

I haven't use it though and I don't know if Go driver supports arrays.

perillo commented 8 years ago

On Mon, Mar 7, 2016 at 5:47 PM, kostya-sh notifications@github.com wrote:

[...]

BTW, in postgresql it is also possible to use the following SQL for bulk insert:

INSERT INTO mytable (col1, col2, col3) VALUES (unnest(?), unnest(?), unnest(?))

I haven't use it though and I don't know if Go driver supports arrays.

This is what I was speaking about. And it is not PostgreSQL specific, but SQL standard. It does not use an array, but the VALUES statement: http://www.postgresql.org/docs/9.5/static/sql-values.html

In Go, it can be defined, e.g.:

type Tuple []interface{} // Since Row and Value are already defined

type Values []Tuple

This have the advantage that a Values value can be specified as a parameter to Query or Exec function without changing the sql package API.

natemurthy commented 6 years ago

@perillo I tried your method:

type Value []interface{} // defined in the sql package

batch := []Value
for i := 0; i < N; i++ {
    batch = append(batch, Value{1, 1.3, "x"})
}
db.Exec("INSERT INTO films (code, title, did, date_prod, kind) VALUES ?", batch)

but results in the error:

     |  Error:          Expected nil, but got: &errors.errorString{s:"sql: converting argument $1 type: unsupported type []Value, a slice of slice"}

pimbrouwers commented 6 years ago

Totally beyond the scope of golang. You could always manually open a transaction, process all your inserts individually (don't worry there's connection pooling), and commit the transaction. Avoiding the overhead of the implicit transaction, on each iteration, will be huge win.

For accessing the bulk features of the various RDBMS, like bcp in SQL Server, for example you can always save the csv to disk and use exec to run the batch.

kardianos commented 6 years ago

@pimbrouwers I disagree. I think it would be great to create a standard bulk data interface. Yes, opening a transaction will increase the speed of many inserts. But bulk interfaces are also useful. I personally hate relying on native commands like bcp to be present on the host system.

pimbrouwers commented 6 years ago

Unless the standard interface has platform specific adapters, it won't work, because there isn't an adhered to standard for bulk action, for example SQL Server uses BULK INSERT... and MySQL uses LOAD DATA INFILE....

As far as relying on the presence of bulk tools, I see nothing wrong with that. You physically cannot install SQL Server with bcp, which SQL Server relies upon internally for so many things. So the presence of SQL Server means bcp is also present.

kardianos commented 6 years ago

Yes, like query protocols, each would need a driver.

I run my app on Linux and Ms SQL server on Windows. The app won't have bcp installed. Often different boxes or app in bare container. Or database hosted as a service.

thda commented 4 years ago

I think a way to batch values before sending them is needed. It would save network roundtrips and allow the driver to parse the statements and transparently switch to bulk inserts.

For reference jdbc provides 3 functions:

addBatch to push values
sendBatch to flush to the server
clearBatch to discard the statements without sending them.

Statements could implement 2 methods to mirror this functionality:

// add the values to a batch of statements
AddValues(args []driver.NamedValue) error

// initialize/clear the batch of values
ClearValues() error

Then calling an Exec/ExecContext would send the batched statements to the server. I don't know if a context would be needed for AddValues here, as there probably should not be round trips to the server when enqueuing Values.

muhlemmer commented 3 years ago

I crafted a proposal in my own fork. I'm open to comments.

If we get to something definitive, I'll write the necessary units tests and submit a CL. I'm also thinking of preparing a reference implementation on a fork of lib/pq, to be PR-ed after this ever gets into a release.

First let's talk if the proposed API is useful.

I would like to note I'm a PostgreSQL / lib/pq user. The below proposal is based on what I know about those. Perhaps naming of types is biased towards the PostgreSQL terminology. We can change it to anything more suitable if required.

Driver interface

In order to truly benefit from increased performance and less (or single) round trips, a bulk action should happen asynchronous. If that is not the case, a prepared statement would suffice anyway.

Currently lib/pq uses a prepared statement, but switches to asynchronous streaming in the implementation code. Callers need to call stmt.Exec(args) for every row of data. Exec may return an error that is related to this or an earlier call, due to buffering. In order to flush the buffer and terminate the COPY, one call to Exec() without args must be made. Although this is workable, it is a confusing interface. For instance, one can't reliably determine the amount of data sent or written to the database.

In order to properly describe and work with the asynchronous relationship of data, result and error, I would like to propose the use of channels. The driver shall return a write-only data and a read-only result channel.

When the sender is terminating, data will be closed
data may be a buffered channel, that's up to the driver.
When the driver encounters and error, of finished flushing after closing of data, it sends exactly 1 result.
Result may have both Res and Err populated to indicate a partial write at time of error.

To be discussed:

As far as I understand not all drivers have Result support? I'm not sure how to handle that, other than to document this behavior.
Other optional interfaces use some kind of fallback and ErrSkip. In this case there is nothing real to fallback to, unless we are going to use a hard-coded INSERT query on a prepared statement and Exec for each row, just to simulate the batch behavior. I don't think we should do something like that. For instance, are all drivers / dialects compliant to the INSERT INTO mytable (col1, col2, col3) VALUES (?, ?, ?) syntax?

// CopyResult signals termination of a CopyIn.
// Both Res and Err may be populated in case of a partial write.
type CopyResult struct {
    Res Result
    Err error
}

// Copier is an optional interface that may be implemented by a Conn.
type Copier interface {

    // CopyIn sends a batch of data to the server in a single execution.
    // The driver may do this asynchronously.
    //
    // The data channel may be buffered.
    // The sender closes the data channel when all data is sent.
    //
    // Exact one of CopyResult must always be send on the result channel.
    // Either after all data is flushed, or after encountering an error.
    CopyIn(ctx context.Context, table string, columns ...string) (data chan<- []NamedValue, result <-chan CopyResult)
}

sql package interface

Here I went with an interface type, passed to CopyIn(). With batches, the data set might come from another source like a file, decoder or network. This prevents the need of the data set to be loaded in memory. Conveniently, io.EOF is used to signal the end of a batch.

If the Batch interface is accepted, I intend to provide a helper function that can turn [][]interface{} into a Batch.

// Batch is input to a CopyIn command.
type Batch interface {
    Table() string
    Columns() []string
    // Next returns a row of data in a batch.
    // It should return io.EOF when no more data is available.
    Next() ([]interface{}, error)
}

// ErrNotCopier is returned when the driver does not support batch operations.
var ErrNotCopier = errors.New("sql: driver does not support CopyIn")

// CopyIn sends the provided batch in a single, asynchronous operation.
//
// In case of an error, partial data may have been written to the database.
// If supported, a driver may return both Result and error to indicate the amount of rows written.
func (tx *Tx) CopyIn(ctx context.Context, batch Batch) (Result, error) {
    dc, release, err := tx.grabConn(ctx)
    if err != nil {
        return nil, err
    }
    return tx.db.copyDC(ctx, dc, release, batch)
}

The rest of the copyDC implementation can be found here. It compiles, but I haven't tested anything yet, pending some feedback here.

odeke-em commented 3 years ago

Cool, thank you @muhlemmer for the proposal ideas. This issue is very old and I didn’t want to edit the original issue, but given your elaborate post in https://github.com/golang/go/issues/5171#issuecomment-710774842, perhaps let’s reuse this issue as is and am kindly putting it on the radar of the @golang/proposal-review team to skip to the linked comment.

golang / go

proposal: database/sql: support a way to perform bulk actions #5171

Driver interface

sql package interface