Closed Martin2112 closed 6 years ago
@RJPercival were you working on a fix or shall I take another look at it? We just saw it again.
I looked into go-sql-driver/mysql and database/sql and, from what I could tell, the rollback and read share a connection buffer but use different locks, which is the root of why this race occurs. I haven't looked at what we could do in Trillian to avoid reads and rollbacks happening concurrently though. The rollback occurs in a goroutine created in database/sql.go, so there's no controlling when that executes.
I can repro this race with a much simpler program, and (as @RJPercival says) it looks like it might be incorrect locking in the Golang database/sql
code. My guesses:
DB.begin()
kicks off a separate goroutine running Tx.awaitDone()
; this blocks for now.Query()
operation, which creates a DB.Rows
object; we start iterating with Next()
.database.Rows.Next()
method only holds a lock on the Rows.closemu
while invoking the underlying driver.Rows()
method (and using the transaction).Tx.awaitDone()
, which calls Tx.rollback()
, which invokes the driver's driver.Txi.rollback()
function under the Tx.dc *driverConn
lock.So the underlying driver transaction is used with 2 different locks -- the driverConn
lock and the Rows.closemu
lock -- hence the race. It's comparatively rare because it needs the context to get cancelled out from underneath the transaction.
Utterly speculative change would be something like the following:
diff --git a/src/database/sql/sql.go b/src/database/sql/sql.go
index c016681fca13..08b0a415a433 100644
--- a/src/database/sql/sql.go
+++ b/src/database/sql/sql.go
@@ -2146,7 +2146,9 @@ func (rs *Rows) nextLocked() (doClose, ok bool) {
if rs.lastcols == nil {
rs.lastcols = make([]driver.Value, len(rs.rowsi.Columns()))
}
- rs.lasterr = rs.rowsi.Next(rs.lastcols)
+ withLock(rs.dc, func() {
+ rs.lasterr = rs.rowsi.Next(rs.lastcols)
+ })
if rs.lasterr != nil {
// Close the connection if there is a driver error.
if rs.lasterr != io.EOF {
(but that might have lock ordering issues).
Recent golang commit 729685c1d1bb looks related but not exactly the same issue.
Smaller test program to repro the issue:
/*
mysql -u root -e 'DROP DATABASE IF EXISTS racetest;'
mysql -u root -e 'CREATE DATABASE racetest;'
mysql -u root -e "GRANT ALL ON racetest.* TO 'racetest'@'localhost' IDENTIFIED BY 'zaphod';"
mysql -u racetest -pzaphod -D racetest -e "CREATE TABLE IF NOT EXISTS MyTable( val1 VARCHAR(200), val2 BIGINT NOT NULL, val3 BIGINT NOT NULL, PRIMARY KEY(val1) );"
mysql -u racetest -pzaphod -D racetest -e "INSERT INTO MyTable VALUES ('one', 1, -1), ('two', 2, -2), ('three', 3, -3);"
go run -race ./dbtest.go --logtostderr
*/
package main
import (
"context"
"database/sql"
"flag"
"fmt"
"time"
_ "github.com/go-sql-driver/mysql"
"github.com/golang/glog"
)
var (
dbFlag = flag.String("db", "racetest:zaphod@tcp(127.0.0.1:3306)/racetest", "Database")
)
func main() {
flag.Parse()
ctx, cancel := context.WithCancel(context.Background())
db, err := sql.Open("mysql", *dbFlag)
if err != nil {
glog.Exitf("failed to Open(%q): %v", *dbFlag, err)
}
defer db.Close()
tx, err := db.BeginTx(ctx, nil)
if err != nil {
glog.Exitf("failed to BeginTx(): %v", err)
}
rows, err := tx.Query("SELECT val1, val2, val3 FROM MyTable")
if err != nil {
glog.Exitf("failed to Query(): %v", err)
}
count := 0
for rows.Next() {
count++
var val1 string
var val2, val3 int64
if err := rows.Scan(&val1, &val2, &val3); err != nil {
glog.Exitf("failed to Scan(): %v", err)
}
fmt.Printf("%s %d %d\n", val1, val2, val3)
if count == 1 {
glog.Errorf("cancel the context out from under the transaction!")
cancel()
}
time.Sleep(100 * time.Millisecond)
}
rows.Close()
tx.Commit()
}
cc @kardianos just in case he's interested
Thanks. I'll look into this.
The fix for this is being considered for inclusion in Go 1.9, otherwise we'll get it by Go 1.10 (https://github.com/golang/go/issues/21117).
Looks like this was fixed upstream.
Travis log here:
https://travis-ci.org/google/trillian/jobs/232450746
Extract of error below. Happens inside the Go driver but I wonder if there's a path through our code that doesn't clean up properly on error as it happens when a transaction subsequently tries to start. We should examine both pieces of code that were involved to see if there's anything suspicious.
Also watch to see if it reproduces.