jschaf / pggen

Generate type-safe Go for any Postgres query. If Postgres can run the query, pggen can generate code for it.
MIT License
282 stars 26 forks source link

Question: Insert statement with multiple value rows. #40

Closed hebime closed 2 years ago

hebime commented 2 years ago

Hello,

Is it possible to generate an insert func that can insert multiple rows at once? Ex:

INSERT INTO user 
    (username, password, created_at) 
VALUES
    ('a', '123', NOW()),
    ('b', 'abc', NOW()),
    ('c', '456', NOW()),
    ('d', 'def', NOW());

I understand that there's a batch API available but that still executes N queries on the DB vs just one with multiple value rows. At that point, would you know if there's a difference on the DB side performance wise?

Thanks!

jschaf commented 2 years ago

would you know if there's a difference on the DB side performance wise?

It's almost certainly faster to insert multiple rows in a single insert since you avoid all the transaction overhead but I have no idea how much faster.

Using a variable number of VALUES expressions requires dynamic SQL which something I want but haven't figured how to go about it. However, there few options that work as detailed below:

Schema

CREATE TABLE users (
  username   text,
  password   text,
  created_at timestamptz
);

Option 1: Insert many rows using unnest with arrays of columns

This is the approach I've been using. Seems like the most straight-forward way to me.

query.sql

-- name: InsertMany :many
WITH
  input AS (
    SELECT username, password, created_at
    FROM unnest(
      pggen.arg('usernames')::bigint[],
      pggen.arg('passwords')::text[],
      pggen.arg('created_ats')::timestamptz[]
      ) t(username, password, created_at)
  )
INSERT INTO users (username, password, created_at)
SELECT username, password, created_at
FROM input
RETURNING *;

This generates:

type InsertManyParams struct {
    Usernames  []int
    Passwords  []string
    CreatedAts pgtype.TimestamptzArray
}

type InsertManyRow struct {
    Username  string    `json:"username"`
    Password  string    `json:"password"`
    CreatedAt time.Time `json:"created_at"`
}

// InsertMany implements Querier.InsertMany.
func (q *DBQuerier) InsertMany(ctx context.Context, params InsertManyParams) ([]InsertManyRow, error) {}

Option 2: Insert many rows using the table composite type

If you're inserting all the columns of a table you can take advantage of the fact that each table is also a composite type. So for your user table, there's also a user composite type. I've renamed to users since user is a reserved keyword and IntelliJ doesn't like it.

NOTE: this approach fail if you have any generated columns.

query.sql

-- name: InsertManyValues :many
INSERT INTO users
SELECT *
FROM unnest(pggen.arg('all_users')::users[])
RETURNING *;

Which generates:

// Users represents the Postgres composite type "users".
type Users struct {
    Username  string    `json:"username"`
    Password  string    `json:"password"`
    CreatedAt time.Time `json:"created_at"`
}

type InsertManyValuesRow struct {
    Username  string    `json:"username"`
    Password  string    `json:"password"`
    CreatedAt time.Time `json:"created_at"`
}

// InsertManyValues implements Querier.InsertManyValues.
func (q *DBQuerier) InsertManyValues(ctx context.Context, allUsers []Users) ([]InsertManyValuesRow, error) {}

Option 3: Insert many rows but only a subset of columns

Let's say created_at has a default value that we're happy with but we want to use a composite type instead of an array for each column. Our schema changes to:

CREATE TABLE users (
  username   text,
  password   text,
  created_at timestamptz DEFAULT now()
);

First, we need a composite type of the row we want to insert:

CREATE TYPE partial_user AS (
  username text,
  password text
);

query.sql

-- name: InsertPartial :many
INSERT INTO users (username, password)
SELECT *
FROM unnest(pggen.arg('partial_users')::partial_user[])
RETURNING *;

Which generates:

// PartialUser represents the Postgres composite type "partial_user".
type PartialUser struct {
    Username string `json:"username"`
    Password string `json:"password"`
}

type InsertPartialRow struct {
    Username  string    `json:"username"`
    Password  string    `json:"password"`
    CreatedAt time.Time `json:"created_at"`
}

// InsertPartial implements Querier.InsertPartial.
func (q *DBQuerier) InsertPartial(ctx context.Context, partialUsers []PartialUser) ([]InsertPartialRow, error) {}
jschaf commented 2 years ago

Out of curiosity, what are you using pggen for? It's fun to see what other folks use it for.

If you run into any problems let me know. I know option 1 will work but I haven't personally tried option 2 or 3 and I'm not sure if test coverage extends to passing in arrays of composite types as input.

hebime commented 2 years ago

Thanks for the detailed response. I think the options you laid out would work for my use case. I didn't even know about the table composite types so that's pretty cool.

It's almost certainly faster to insert multiple rows in a single insert since you avoid all the transaction overhead but I have no idea how much faster.

After I wrote the question, I remembered that the batch automatically starts a transaction so yes, there would definitely be overhead there. Still nice that the batch avoids all the network round trips after the first though. I could theoretically use a batch here since my query is in a transaction anyways, but this is good to know for cases where I don't need a transaction.

Out of curiosity, what are you using pggen for? It's fun to see what other folks use it for.

We've actually switched to using pggen to generate the boilerplate for our DAL in our code base. We used to write everything manually by hand but we found that your tool works very well. We don't like using ORMs so we appreciate that we can still write our queries and have all the tedious (and error prone) type scanning and query execution be handled by the generated code.

As for the dynamically generated SQL, I've written some functions in the past that have solved this specific use case when it comes to inserting multiple values at once. If you like, I can try to open a PR with a POC. Off the top of my head, here is how it could work:

For syntax maybe something like this:

-- name: InsertMany :exec
INSERT INTO users ('username', 'password', 'created_at')
VALUES [](pggen.arg('username'), pggen.arg('password'), pggen.arg('created_at'));
-- or
VALUES pggen.array(pggen.arg('username'), pggen.arg('password'), pggen.arg('created_at'));
-- or
VALUES ...(pggen.arg('username'), pggen.arg('password'), pggen.arg('created_at'));

This would generate the following:

const numInsertManyParamFields = 3
const insertManySQL = `INSERT INTO users ('username', 'password', 'created_at') VALUES %s;`

type InsertManyParams struct {
    Username  string
    Password  string
    CreatedAt pgtype.Timestamptz
}

func (p InsertManyParams) flattenValues() []interface{} {
    values := make([]interface{}, 0, numInsertManyParamFields)
    return append(values, p.Username, p.Password, p.CreatedAt)
}

func (q *DBQuerier) InsertMany(ctx context.Context, params []InsertManyParams) (cmdTag, error) {
    // generate values SQL
    valueSQL := make([]string, 0, len(params))
    for row := 0; row < len(params); row++ {
        value := strings.Builder{}
        value.WriteString("(")
        for col := row*numInsertManyParamFields + 1; col <= (row+1)*numInsertManyParamFields; col++ {
            value.WriteString(fmt.Sprintf("$%d", col))
            if col != (row+1)*numInsertManyParamFields {
                value.WriteString(",")
            }
        }
        value.WriteString(")")
        valueSQL = append(valueSQL, value.String())
    }
    SQL := fmt.Sprintf(insertManySQL, strings.Join(valueSQL, ","))

    // generate values []interface{}
    values := make([]interface{}, 0, numInsertManyParamFields*len(params))
    for _, param := range params {
        values = append(values, param.flattenValues()...)
    }

    ctx = context.WithValue(ctx, "pggen_query_name", "InsertMany")
    cmdTag, err := q.conn.Exec(ctx, SQL, values...)
    if err != nil {
        return cmdTag, fmt.Errorf("exec query InsertMany: %w", err)
    }
    return cmdTag, err
}

Here is the implementation in a go playground so you can see how it works. The exec func just prints out the generated SQL. The pgtypes have all been replaced with base types for simplicity: https://play.golang.org/p/CcezMTKZQmh Please keep in mind the generate valuesSQL part is a bit messy, but I'm sure it can be cleaned up. The second loop to generate the interface values could also be combined with the first but this is more readable for demonstration purposes. This is just an example to show how it could potentially work.

One thing of note with this approach is that postgres has a max amount of parameters that can be bound to using a prepared statement. I believe the max is 65355 because the protocol uses an Int16 to pass the number of parameters (see Bind: https://www.postgresql.org/docs/current/protocol-message-formats.html). If necessary, a check can be added though since we know how many rows and fields are being passed in.

jschaf commented 2 years ago

I think the options you laid out would work for my use case. I didn't even know about the table composite types so that's pretty cool.

One of the fringe benefits of writing a Go-Postgres mapping is you learn the Postgres type system in gory detail.

We've actually switched to using pggen to generate the boilerplate for our DAL in our code base. We used to write everything manually by hand but we found that your tool works very well

Glad to hear it! That's pretty much why I built it. I like SQL but didn't like doing the mapping to Go.

As for the dynamically generated SQL, I've written some functions in the past that have solved this specific use case when it comes to inserting multiple values at once. If you like, I can try to open a PR with a POC. Off the top of my head, here is how it could work:

One thing I've tried to avoid is extending Postgres syntax for pggen. In our codebase, we declare pggen.arg as a function that returns any so it means it highlights correctly in IntelliJ (and presumably other editors).

-- This schema file exists solely so IntelliJ doesn't underline every
-- pggen.arg() expression in squiggly red.
CREATE SCHEMA pggen;

-- pggen.arg defines a named parameter that's eventually compiled into a
-- placeholder for a prepared query: $1, $2, etc.
CREATE FUNCTION pggen.arg(param text) RETURNS any AS '';

In general, I'm okay with requiring clients to use more cumbersome hacks like the options above for things that can be done in Postgres instead of pggen. SQL has a bunch of awkward constructs that look like values but you can't use them for prepared queries, like the VALUES constructor and the x IN (1, 2, 3) expression but there are workarounds for both. The rationale is to keep the number of codepaths low and keep pggen relatively easy to hack on.

For dynamic queries that you can't do in Postgres, I'm interested in adding support to pggen. I'll probably add something like pggen.predicate first and follow up with pggen.ident for dynamic columns and group-bys (group-bys are a can of worms though since changing the group-by can make the query invalid). If you're interested, I'll let you know when I do the design doc for either one.

jschaf commented 2 years ago

I started a tracking bug for dynamic predicates: https://github.com/jschaf/pggen/issues/41

jschaf commented 2 years ago

I'll close this since there's a reasonable workaround that doesn't require any extra syntax and focus on the dynamic predicates.