lib / pq

Pure Go Postgres driver for database/sql
https://pkg.go.dev/github.com/lib/pq
MIT License
8.86k stars 908 forks source link

Swap driver.Value for driver.NamedValue in internal APIs #1067

Open kevinburke opened 2 years ago

kevinburke commented 2 years ago

The new QueryContext and ExecContext API's both take a driver.NamedValue instead of a driver.Value. Because pq internally uses driver.Value this means that the first thing that happens with both API's is a copy:

// Implement the "StmtExecContext" interface
func (st *stmt) ExecContext(ctx context.Context, args []driver.NamedValue) (driver.Result, error) {
    list := make([]driver.Value, len(args))
    for i, nv := range args {
        list[i] = nv.Value
    }

This means that every call to this function with arguments allocates. Note also that database/sql will use QueryContext if it exists, so every call from database/sql is going through that call path now:

// queryDC executes a query on the given connection.
// The connection gets released by the releaseConn function.
// The ctx context is from a query method and the txctx context is from an
// optional transaction context.
func (db *DB) queryDC(ctx, txctx context.Context, dc *driverConn, releaseConn func(error), query string, args []interface{}) (*Rows, error) {
    queryerCtx, ok := dc.ci.(driver.QueryerContext)
    var queryer driver.Queryer
    if !ok {
        queryer, ok = dc.ci.(driver.Queryer)
    }
    if ok {
        var nvdargs []driver.NamedValue
        var rowsi driver.Rows
        var err error
        withLock(dc, func() {
            nvdargs, err = driverArgsConnLocked(dc.ci, nil, args)
            if err != nil {
                return
            }
            rowsi, err = ctxDriverQuery(ctx, queryerCtx, queryer, query, nvdargs)
        })

Instead of using driver.Value internally, if all of the pq internal API's use driver.NamedValue, this saves an allocation in the most common case.

The patch implemented here: https://github.com/kevinburke/pq/compare/named-value?expand=1 improves on the PreparedSelect benchmark by about 4% on my Mac (the rest of the results appear to be noise)

name                                  old time/op    new time/op    delta
BoolArrayScanBytes-10                    530ns ± 1%     530ns ± 1%    ~     (p=0.548 n=5+5)
BoolArrayValue-10                       66.3ns ± 1%    66.7ns ± 0%    ~     (p=0.095 n=5+5)
ByteaArrayScanBytes-10                   980ns ± 2%     976ns ± 1%    ~     (p=0.690 n=5+5)
ByteaArrayValue-10                       279ns ± 2%     281ns ± 2%    ~     (p=0.310 n=5+5)
Float64ArrayScanBytes-10                 960ns ± 1%     956ns ± 4%    ~     (p=0.310 n=5+5)
Float64ArrayValue-10                     969ns ± 2%     966ns ± 1%    ~     (p=0.421 n=5+5)
Int64ArrayScanBytes-10                   626ns ± 1%     624ns ± 1%    ~     (p=0.421 n=5+5)
Int64ArrayValue-10                       483ns ± 2%     485ns ± 2%    ~     (p=0.841 n=5+5)
Float32ArrayScanBytes-10                 941ns ± 1%     950ns ± 2%    ~     (p=0.246 n=5+5)
Float32ArrayValue-10                     654ns ± 0%     663ns ± 2%  +1.36%  (p=0.016 n=5+5)
Int32ArrayScanBytes-10                   623ns ± 1%     619ns ± 1%    ~     (p=0.246 n=5+5)
Int32ArrayValue-10                       328ns ± 1%     333ns ± 1%  +1.53%  (p=0.008 n=5+5)
StringArrayScanBytes-10                 1.36µs ±10%    1.33µs ± 1%    ~     (p=0.579 n=5+5)
StringArrayValue-10                     2.51µs ± 2%    2.58µs ± 9%    ~     (p=0.421 n=5+5)
GenericArrayScanScannerSliceBytes-10    2.62µs ± 1%    2.66µs ± 3%    ~     (p=0.095 n=5+5)
GenericArrayValueBools-10                642ns ± 1%     647ns ± 1%    ~     (p=0.151 n=5+5)
GenericArrayValueFloat64s-10            1.91µs ± 1%    1.89µs ± 1%    ~     (p=0.151 n=5+5)
GenericArrayValueInt64s-10              1.09µs ± 0%    1.11µs ± 1%  +1.26%  (p=0.024 n=5+5)
GenericArrayValueByteSlices-10          2.69µs ± 1%    2.71µs ± 2%    ~     (p=0.690 n=5+5)
GenericArrayValueStrings-10             2.88µs ± 1%    2.89µs ± 0%    ~     (p=0.206 n=5+5)
SelectString-10                         28.8µs ± 3%    28.6µs ± 1%    ~     (p=0.690 n=5+5)
SelectSeries-10                         52.1µs ± 2%    52.0µs ± 1%    ~     (p=1.000 n=5+5)
MockSelectString-10                      663ns ± 1%     665ns ± 2%    ~     (p=1.000 n=5+5)
MockSelectSeries-10                     7.12µs ± 1%    7.12µs ± 0%    ~     (p=0.690 n=5+5)
PreparedSelectString-10                 28.2µs ± 5%    27.0µs ± 1%  -4.47%  (p=0.008 n=5+5)
PreparedSelectSeries-10                 45.1µs ± 1%    44.7µs ± 1%  -0.86%  (p=0.032 n=5+5)
MockPreparedSelectString-10              336ns ± 1%     342ns ± 3%  +1.68%  (p=0.016 n=5+5)
MockPreparedSelectSeries-10             6.77µs ± 1%    6.79µs ± 0%    ~     (p=0.310 n=5+5)
EncodeInt64-10                          22.6ns ± 0%    22.7ns ± 1%    ~     (p=0.579 n=5+5)
EncodeFloat64-10                        65.0ns ± 2%    64.6ns ± 1%    ~     (p=0.548 n=5+5)
EncodeByteaHex-10                       78.5ns ± 2%    80.9ns ± 4%  +3.07%  (p=0.016 n=5+5)
EncodeByteaEscape-10                     125ns ± 1%     125ns ± 1%    ~     (p=0.508 n=5+5)
EncodeBool-10                           14.8ns ± 0%    14.9ns ± 1%    ~     (p=0.143 n=4+5)
EncodeTimestamptz-10                     264ns ± 1%     264ns ± 1%    ~     (p=0.690 n=5+5)
DecodeInt64-10                          32.5ns ± 1%    32.6ns ± 1%    ~     (p=0.548 n=5+5)
DecodeFloat64-10                        43.7ns ± 1%    44.0ns ± 2%    ~     (p=0.595 n=5+5)
DecodeBool-10                           2.56ns ± 0%    2.55ns ± 1%  -0.62%  (p=0.032 n=5+5)
DecodeTimestamptz-10                     146ns ± 0%     147ns ± 1%    ~     (p=0.063 n=5+5)
DecodeTimestamptzMultiThread-10          176ns ± 2%     179ns ± 3%    ~     (p=0.222 n=5+5)
LocationCache-10                        37.1ns ± 1%    36.7ns ± 1%  -1.18%  (p=0.008 n=5+5)
LocationCacheMultiThread-10              162ns ± 1%     160ns ± 1%  -1.11%  (p=0.008 n=5+5)
ResultParsing-10                        3.81ms ± 0%    3.81ms ± 0%    ~     (p=0.730 n=4+5)
_writeBuf_string-10                     1.58ns ± 1%    1.56ns ± 0%  -1.70%  (p=0.008 n=5+5)
CopyIn-10                                309ns ± 4%     308ns ± 2%    ~     (p=0.889 n=5+5)
AppendEscapedText-10                    2.29µs ± 1%    2.30µs ± 1%    ~     (p=0.548 n=5+5)
AppendEscapedTextNoEscape-10            1.00µs ± 0%    1.01µs ± 0%  +0.54%  (p=0.024 n=5+5)
DecodeUUIDBinary-10                     37.8ns ± 1%    38.0ns ± 1%    ~     (p=0.095 n=5+5)

This patch improves performance on my rickover dequeue benchmark (github.com/kevinburke/rickover), which measures how fast I can get rows out of the database. I can try to get statistically significant results, but you can see it reduces the number of allocations and it's reasonable to assume that performance is also improved.

$ benchstat /tmp/old /tmp/new
name                 old time/op    new time/op    delta
Dequeue/Dequeue1-10    8.00ms ±10%    7.74ms ± 4%   ~     (p=0.421 n=5+5)

name                 old speed      new speed      delta
Dequeue/Dequeue1-10   0.00B/s        0.00B/s        ~     (all equal)

name                 old alloc/op   new alloc/op   delta
Dequeue/Dequeue1-10    12.3kB ±13%    12.1kB ± 2%   ~     (p=0.690 n=5+5)

name                 old allocs/op  new allocs/op  delta
Dequeue/Dequeue1-10       160 ±13%       155 ± 1%   ~     (p=1.000 n=5+5)