jackc / pgx

PostgreSQL driver and toolkit for Go
MIT License
10.89k stars 848 forks source link

Should I call pgx.Identifier.Sanitize method explicitly when I build the sql string? #764

Open mrdulin opened 4 years ago

mrdulin commented 4 years ago

I write some functions for building SQL string, should I call pgx.Identifier.Sanitize method explicitly to sanitize the identifiers? Or, just call strings.Join(cols, ",") and pgx will sanitize the column Identifier internally? Need clarification, thanks. E.g.

sql.go:

func BuildInsertSQL(tableName string, cols []string, vals []interface{}) string {
    sqlStatement := "insert into " + Sanitize(tableName) + "(" + MulSanitize(cols) + ") values ("
    for i := 1; i <= len(vals); i++ {
        if i == len(vals) {
            sqlStatement = sqlStatement + "$" + strconv.Itoa(i) + ") returning *"
            break
        }
        if i%len(cols) != 0 {
            sqlStatement = sqlStatement + "$" + strconv.Itoa(i) + ","
        } else {
            sqlStatement = sqlStatement + "$" + strconv.Itoa(i) + "),("
        }
    }
    return sqlStatement
}

func Sanitize(ident string) string {
    return pgx.Identifier.Sanitize([]string{ident})
}

func MulSanitize(idents []string) string {
    var ret []string
    for _, id := range idents {
        ret = append(ret, Sanitize(id))
    }
    return strings.Join(ret, ",")
}

Unit tests: sql_test.go:

func TestBuildInsertSQL(t *testing.T) {
    type arg struct {
        tableName string
        cols []string
        vals []interface{}
    }
    tests := []struct{
        name string
        arg arg
        want string
    } {
        {
            name: "should return valid sql string for inserting one row",
            arg: arg{
                tableName: "TEST",
                cols: []string{"id", "name", "email"},
                vals: []interface{}{1, "a", "e@gmail.com"},
            },
            want: "insert into \"TEST\"(\"id\",\"name\",\"email\") values ($1,$2,$3) returning *",
        },
        {
            name: "should return valid sql string for bulk inserting",
            arg: arg{
                tableName: "TEST",
                cols: []string{"id", "name", "email"},
                vals: []interface{}{1, "a", "e@gmail.com", 2, "b", "d@gmail.com"},
            },
            want: "insert into \"TEST\"(\"id\",\"name\",\"email\") values ($1,$2,$3),($4,$5,$6) returning *",
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            got := sql.BuildInsertSQL(tt.arg.tableName, tt.arg.cols, tt.arg.vals)
            test.Equals(t, tt.want, got)
        })
    }
}

func TestSanitize(t *testing.T) {
    tests := []struct{
        name string
        arg string
        want string
    } {
        {"should return sanitized identifier", "USER", "\"USER\""},
    }
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            got := sql.Sanitize(tt.arg)
            test.Equals(t, tt.want, got)
        })
    }
}

func TestMulSanitize(t *testing.T) {
    tests := []struct{
        name string
        arg []string
        want string
    } {
        {"should return sanitized identifier", []string{"id", "name"}, "\"id\",\"name\""},
    }
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            got := sql.MulSanitize(tt.arg)
            test.Equals(t, tt.want, got)
        })
    }
}
jackc commented 4 years ago

Your SQL string should have placeholders like $1 and $2. You shouldn't be sanitizing values directly.

I've built a number of helpers for this type thing. For a few examples, see https://github.com/jackc/pgxutil and https://github.com/jackc/pgsql.

In particular, since your example is a generic insert builder, check out: https://github.com/jackc/pgxutil/blob/c7a206b68e6233635504c92af4667c7ee8d43126/pgxutil.go#L585

You're welcome to import and use those libraries directly, but neither has had an official release and may be abandoned or changed without notice -- so you may rather clone whatever functionality you want or use it as a guide to build your own helpers.

Zamiell commented 4 years ago

Hey Jack,

Is there ever a chance of bulk-insert functioning making its way into pgx officially? It feels like bulk-insertion is probably a really common use-case.

I myself started to build a SQL string manually like mrdulin did in the OP after reading #82 (which implied that this kind of thing was impossible), so I am glad to stumble upon this thread! I took a look at pgxutil.Insert(), but it is a little unclear. Can you provide a quick example of using pgxutil.Insert() for bulk inserting, say, 3 things? Thank you.

jackc commented 4 years ago

pgxutil.Insert() only inserts one row at a time. But pgsql.InsertStatement and pgsql.ValuesStatement can be used to build a multi-row insert statement. But as I mentioned above, neither of these packages have officially supported releases. At this point they are basically my personal toolkit that I update or change as my own projects dictate. So it is better to think of them as example code that you can copy from rather than as a library to rely on.

But as far as pgx itself, if you need to efficiently insert multiple records you have 2 options. The fastest is to use CopyFrom(). The simplest is to use SendBatch() to send multiple queries at once.

And of course it is possible -- albeit a little awkward -- to build multi-row insert SQL and arguments manually. See https://github.com/jackc/tpr/blob/66a04f276a8478a6e5ba9747790b834b6b965efe/backend/data/item.go#L205 for an example of doing it the hard way.

Zamiell commented 4 years ago

This is what I came up with, maybe it will be useful to future readers of this thread:

// getBulkInsertSQL is a helper function to prepare a SQL query for a bulk insert.
//
// For example:
//
// SQLString = "INSERT INTO notes (thing_a, thing_b) VALUES %s"
// rowValueSQL  = "?, ?"
// numRows = 3
//
// Would be transformed into:
//
// INSERT INTO notes (thing_a, thing_b)
// VALUES
//     ($1, $2),
//     ($3, $4),
//     ($5, $6)
//
// Also see:
// https://stackoverflow.com/questions/12486436/how-do-i-batch-sql-statements-with-package-database-sql
func getBulkInsertSQL(SQLString string, rowValueSQL string, numRows int) string {
    // Combine the base SQL string and N value strings
    valueStrings := make([]string, 0, numRows)
    for i := 0; i < numRows; i++ {
        valueStrings = append(valueStrings, "("+rowValueSQL+")")
    }
    allValuesString := strings.Join(valueStrings, ",")
    SQLString = fmt.Sprintf(SQLString, allValuesString)

    // Convert all of the "?" to "$1", "$2", "$3", etc.
    // (which is the way that pgx expects query variables to be)
    numArgs := strings.Count(SQLString, "?")
    SQLString = strings.ReplaceAll(SQLString, "?", "$%v")
    numbers := make([]interface{}, 0, numRows)
    for i := 1; i <= numArgs; i++ {
        numbers = append(numbers, strconv.Itoa(i))
    }
    return fmt.Sprintf(SQLString, numbers...)
}

// getBulkInsertSQLSimple is a helper function to prepare a SQL query for a bulk insert.
// getBulkInsertSQLSimple is used over getBulkInsertSQL when all of the values are plain question
// marks (e.g. a 1-for-1 value insertion).
// The example given for getBulkInsertSQL is such a query.
func getBulkInsertSQLSimple(SQLString string, numArgsPerRow int, numRows int) string {
    questionMarks := make([]string, 0, numArgsPerRow)
    for i := 0; i < numArgsPerRow; i++ {
        questionMarks = append(questionMarks, "?")
    }
    rowValueSQL := strings.Join(questionMarks, ", ")
    return getBulkInsertSQL(SQLString, rowValueSQL, numRows)
}

Here's an example of using the simple function:

func (*ChatLog) BulkInsert(chatLogRows []*ChatLogRow) error {
    SQLString := `
        INSERT INTO chat_log (user_id, message, room)
        VALUES %s
    `
    numArgsPerRow := 3
    valueArgs := make([]interface{}, 0, numArgsPerRow*len(chatLogRows))
    for _, chatLogRow := range chatLogRows {
        valueArgs = append(valueArgs, chatLogRow.UserID, chatLogRow.Message, chatLogRow.Room)
    }
    SQLString = getBulkInsertSQLSimple(SQLString, numArgsPerRow, len(chatLogRows))

    _, err := db.Exec(context.Background(), SQLString, valueArgs...)
    return err
}

And here's an example of using the non-simple function:

func (*GameParticipantNotes) BulkInsert(gameParticipantNotesRows []*GameParticipantNotesRow) error {
    SQLString := `
        INSERT INTO game_participant_notes (
            game_participant_id,
            card_order,
            note
        )
        VALUES %s
    `
    numArgsPerRow := 5
    valueArgs := make([]interface{}, 0, numArgsPerRow*len(gameParticipantNotesRows))
    for _, gameParticipantNotesRow := range gameParticipantNotesRows {
        valueArgs = append(
            valueArgs,
            gameParticipantNotesRow.GameID,
            gameParticipantNotesRow.UserID,
            gameParticipantNotesRow.CardOrder,
            gameParticipantNotesRow.Note,
        )
    }
    valueSQL := `
        (SELECT id FROM game_participants WHERE game_id = ? AND user_id = ?),
        ?,
        ?
    `
    SQLString = getBulkInsertSQL(SQLString, valueSQL, len(gameParticipantNotesRows))

    _, err := db.Exec(context.Background(), SQLString, valueArgs...)
    return err
}
holmbuar commented 3 years ago

@Zamiell amazing, your bulk insert code worked with only minor adjustments. Thanks!

mirecl commented 3 years ago

Faster 5x for generation bulk sql statement:

func getBulkInsertSQL(table string, columns []string, rowCount int) string {
    var b strings.Builder
    var cnt int

    columnCount := len(columns)

    b.Grow(40000) // Need to calculate, I'm too lazy))

    b.WriteString("INSERT INTO " + table + "(" + strings.Join(columns, ", ") + ") VALUES ")

    for i := 0; i < rowCount; i++ {
        b.WriteString("(")
        for j := 0; j < columnCount; j++ {
            cnt++
            b.WriteString("$")
            b.WriteString(strconv.Itoa(cnt))
            if j != columnCount-1 {
                b.WriteString(", ")
            }
        }
        b.WriteString(")")
        if i != rowCount-1 {
            b.WriteString(",")
        }
    }
    return b.String()
}
ashv1n commented 3 years ago

@mirecl is b.Reset() necessary? isn't gc supposed to clean b after function execution?

mirecl commented 3 years ago

@mirecl is b.Reset() necessary? isn't gc supposed to clean b after function execution?

You're absolutely right. Fixed the code

Zamiell commented 3 years ago

Faster 5x

This seems like a pretty misleading thing to say without showing actual benchmarks. The vast majority of the execution time seems likely to be on the actual database insertion, not the string building.

mirecl commented 3 years ago

Faster 5x

This seems like a pretty misleading thing to say without showing actual benchmarks. The vast majority of the execution time seems likely to be on the actual database insertion, not the string building.

I agree with you, I meant fast 5x only for gen sql string building.

pczern commented 11 months ago

This is what I came up with, maybe it will be useful to future readers of this thread:


// getBulkInsertSQL is a helper function to prepare a SQL query for a bulk insert.
//
// For example:
//
// SQLString = "INSERT INTO notes (thing_a, thing_b) VALUES %s"
// rowValueSQL  = "?, ?"
// numRows = 3
//
// Would be transformed into:
//
// INSERT INTO notes (thing_a, thing_b)
// VALUES
//     ($1, $2),
//     ($3, $4),
//     ($5, $6)
//
// Also see:
// https://stackoverflow.com/questions/12486436/how-do-i-batch-sql-statements-with-package-database-sql
func getBulkInsertSQL(SQLString string, rowValueSQL string, numRows int) string {
  // Combine the base SQL string and N value strings
  valueStrings := make([]string, 0, numRows)
...

I've readCopyFrom won't generate autogenerated id. If I use a Batch of Insert statements instead of one Insert statement how will it affect performance? Could it even hinder or slow down query optimization?

Is the provided snippet from @Zamiell the recommended way to do a bulk insert today?

jackc commented 11 months ago

My recommendation is to use CopyFrom whenever possible for bulk insert. IDs will be auto generated, but there is no way to return them to the client application.

If that's not possible then I recommend using a batch with separate insert statements for each row. This is usually easier to work with than a single statement that inserts multiple rows. A single statement can be a bit faster, but IMO it's not worth the hassle.

Here's a sample from the pgx benchmark suite:

BenchmarkWrite100RowsViaInsert-12                                            476       2528141 ns/op       19689 B/op        613 allocs/op
BenchmarkWrite100RowsViaMultiInsert-12                                      1888        617472 ns/op      285941 B/op       1573 allocs/op
BenchmarkWrite100RowsViaBatchInsert-12                                      1728        703070 ns/op      102517 B/op        530 allocs/op
BenchmarkWrite100RowsViaCopy-12                                             3253        372422 ns/op       58755 B/op        355 allocs/op