Closed rlenglet closed 1 year ago
I tried changing if b.state == bolt4_streaming {
into if b.state == bolt4_streaming || b.state == bolt4_streamingtx {
, I don't see the previous error, but now I observe that some sessions can be in yet another state in the pool ("transaction pending"):
invalid state 2, expected: [0]
I.e. after reading the pending explicit transaction's stream, the transaction is still pending.
Even if I try to force the state to change to bolt4_ready
when in that state, the session is unusable.
So this doesn't solve the root issue which is that a session with a pending transaction was returned into the pool.
A cleaner to implement this change would be replace: https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/internal/bolt/bolt4.go#L294-L301 with:
// Makes all outstanding streams invalid
b.discardAllStreams(ctx)
The bug seems to be that ForceReset
does not wait anymore for the response to the reset message, and that causes connections in the tx
/ streamingtx
state to be returned into the pool.
The implementation of ForceReset
now waits for any message, but doesn't verify anymore that that message is a response to the reset:
https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/internal/bolt/bolt4.go#L817-L823
Unlike in 5.0.0-preview, where the implementation does wait explicitly for a reset response: https://github.com/neo4j/neo4j-go-driver/blob/ca573bc36a8c721cfd8e0fc236eb6b9aaeed5555/neo4j/internal/bolt/bolt4.go#L889-L907
I added calls to Reset(ctx)
on any connection returned by the Pool
in Borrow()
here:
https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/internal/pool/pool.go#L279
and in tryBorrow
here:
https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/internal/pool/pool.go#L329
This now does have the intended effect of always resetting all connections used in new SessionWithContext
s.
But now, I see new errors in pending sessions, e.g. when calling Rollback
on an existing ExplicitTransaction
on another open session:
invalid transaction handle
I.e. creating a new session from the same driver/pool as another existing session resets the Bolt4 connection of that existing session.
The same Bolt4 connection seems to be shared by two different SessionWithContext
s.
Hello, thanks for the report.
Unfortunately, I cannot reproduce your initial scenario:
docker run --rm \
--env NEO4J_AUTH='neo4j/letmein!' \
--env NEO4J_ACCEPT_LICENSE_AGREEMENT=yes \
--publish=7687:7687 \
--health-cmd "cypher-shell -u neo4j -p 'letmein!' 'RETURN 1'" \
--health-interval 5s \
--health-timeout 5s \
--health-retries 5 \
neo4j:4.4.15-enterprise
package main
import (
"context"
"fmt"
"github.com/neo4j/neo4j-go-driver/v5/neo4j"
)
func main() {
ctx := context.Background()
auth := neo4j.BasicAuth("neo4j", "letmein!", "")
driver, err := neo4j.NewDriverWithContext("neo4j://localhost", auth)
if err != nil {
panic(err)
}
defer closeNeo4jResource(ctx, driver)
session := driver.NewSession(ctx, neo4j.SessionConfig{})
defer closeNeo4jResource(ctx, session)
for i := 0; i < 5; i++ {
tx, err := session.BeginTransaction(ctx)
if err != nil {
panic(err)
}
result, err := tx.Run(ctx, "CREATE (:Node)", nil)
if err != nil {
panic(err)
}
summary, err := result.Consume(ctx)
if err != nil {
panic(err)
}
fmt.Printf("Nodes created: %d\n", summary.Counters().NodesCreated())
if err := tx.Commit(ctx); err != nil {
panic(err)
}
}
}
func closeNeo4jResource(ctx context.Context, closeable interface{ Close(context.Context) error }) {
if err := closeable.Close(ctx); err != nil {
panic(err)
}
}
Would you be able to adapt the above sample to create a reproducer?
I could not reproduce the issue even when using goroutines:
package main
import (
"context"
"fmt"
"github.com/neo4j/neo4j-go-driver/v5/neo4j"
"sync"
)
func main() {
ctx := context.Background()
auth := neo4j.BasicAuth("neo4j", "letmein!", "")
driver, err := neo4j.NewDriverWithContext("neo4j://localhost", auth)
if err != nil {
panic(err)
}
defer closeNeo4jResource(ctx, driver)
session := driver.NewSession(ctx, neo4j.SessionConfig{})
defer closeNeo4jResource(ctx, session)
goRoutines := 5
var group sync.WaitGroup
group.Add(goRoutines)
for i := 0; i < goRoutines; i++ {
go func(i int) {
defer group.Done()
tx, err := session.BeginTransaction(ctx)
if err != nil {
panic(err)
}
result, err := tx.Run(ctx, "CREATE (:Node)", nil)
if err != nil {
panic(err)
}
summary, err := result.Consume(ctx)
if err != nil {
panic(err)
}
fmt.Printf("Nodes created in run %d: %d\n", i, summary.Counters().NodesCreated())
if err := tx.Commit(ctx); err != nil {
panic(err)
}
}(i)
}
group.Wait()
}
func closeNeo4jResource(ctx context.Context, closeable interface{ Close(context.Context) error }) {
if err := closeable.Close(ctx); err != nil {
panic(err)
}
}
I will likely need more information to proceed.
I can reproduce the bug 100% reliably with one test suite in our non-public codebase. But it doesn’t trigger anymore whenever I comment out any test case, so I’ve not been able to reduce the reproducer yet.
The good news is that I can try out solutions and debug.
This doesn’t seem to be a multi-threading issue. Our test suite uses a single goroutine. That one goroutine begins and commits/rollbacks multiple explicit transactions concurrently, all to the same single Neo4j server.
Our particular test case that causes a failure does something like this in a single goroutine, using a single driver:
Both sessions seem to share the same connection somehow. The driver’s state seems to be erroneous before this code/test case is executed.
The other code/test cases executed before this one execute similar sequences of actions and that seems to cause the driver/pool to be erroneous.
At this point, I suspect that a Bolt4 connection is added twice into the idle queue.
Thanks for the additional details, there are similar stress tests in the Go driver for this, but I'll see if I can reproduce the problem with the described setup.
So, I updated the first sample to follow more closely the scenario you described:
package main
import (
"context"
"fmt"
"github.com/neo4j/neo4j-go-driver/v5/neo4j"
)
func main() {
ctx := context.Background()
auth := neo4j.BasicAuth("neo4j", "letmein!", "")
driver, err := neo4j.NewDriverWithContext("neo4j://localhost", auth)
panicOnErr(err)
defer closeNeo4jResource(ctx, driver)
session1 := driver.NewSession(ctx, neo4j.SessionConfig{})
tx1, err := session1.BeginTransaction(ctx)
panicOnErr(err)
result1, err := tx1.Run(ctx, "CREATE (:Node)", nil)
panicOnErr(err)
session2 := driver.NewSession(ctx, neo4j.SessionConfig{})
tx2, err := session2.BeginTransaction(ctx)
panicOnErr(err)
result2, err := tx2.Run(ctx, "CREATE (:Node)", nil)
panicOnErr(err)
printNodesCreated(ctx, result2)
panicOnErr(tx2.Commit(ctx))
closeNeo4jResource(ctx, session2)
printNodesCreated(ctx, result1)
panicOnErr(tx1.Rollback(ctx))
closeNeo4jResource(ctx, session1)
}
func printNodesCreated(ctx context.Context, result neo4j.ResultWithContext) {
summary1, err := result.Consume(ctx)
panicOnErr(err)
fmt.Printf("Nodes created: %d\n", summary1.Counters().NodesCreated())
}
func closeNeo4jResource(ctx context.Context, closeable interface{ Close(context.Context) error }) {
panicOnErr(closeable.Close(ctx))
}
func panicOnErr(err error) {
if err != nil {
panic(err)
}
}
Unfortunately, I still cannot reproduce the issue. Could you please try adapt this specific sample, see if you can reproduce the issue and share the resulting program? Thanks a lot!
I did more debugging.
At the point of failure, I observe that the driver's busy
and idle
queues contain a total of 3 connections:
busy
contains bolt-0
idle
contains bolt-0
and bolt-1
So bolt-0
is duplicated in busy
and idle
, which should never happen.
In other points, I see idle
for the server containing bolt-1
once and bolt-0
twice.
Here's how to reproduce: Call Run
on a transaction/session that is already closed. That causes the connection to be added again into the idle
queue.
session := driver.NewSession(ctx, neo4j.SessionConfig{
DatabaseName: "neo4j",
AccessMode: neo4j.AccessModeRead,
})
tx, err := session.BeginTransaction(ctx)
if err != nil {
panic("failed to begin transaction: " + err.Error())
}
_, err = tx.Run(ctx, "MATCH (n) RETURN count(n) as nodeCount", nil)
if err != nil {
panic("failed to execute query: " + err.Error())
}
err = tx.Commit(ctx)
if err != nil {
panic("failed to commit transaction: " + err.Error())
}
err = session.Close(ctx)
if err != nil {
panic("failed to close session: " + err.Error())
}
// Execute the query again on the closed tx/session.
// This returns an error as expected.
//
// ***** This causes the connection to be added incorrectly twice into the idle queue. *****
//
_, _ = tx.Run(ctx, "MATCH (n) RETURN count(n) as nodeCount", nil)
// Start two new sessions and begin a transaction in each.
// Those will share the same connection.
session1 := driver.NewSession(ctx, neo4j.SessionConfig{
DatabaseName: "neo4j",
AccessMode: neo4j.AccessModeRead,
})
session2 := driver.NewSession(ctx, neo4j.SessionConfig{
DatabaseName: "neo4j",
AccessMode: neo4j.AccessModeRead,
})
tx1, err := session1.BeginTransaction(ctx)
if err != nil {
panic("failed to begin transaction on session1: " + err.Error())
}
// ***** This transaction incorrectly shares the same connection as tx1. *****
tx2, err := session2.BeginTransaction(ctx)
if err != nil {
panic("failed to begin transaction on session2: " + err.Error())
}
_, err = tx1.Run(ctx, "MATCH (n) RETURN count(n) as nodeCount", nil)
if err != nil {
panic("failed to execute query in tx1: " + err.Error())
}
err = tx1.Rollback(ctx)
if err != nil {
panic("failed to commit tx1: " + err.Error())
}
err = session1.Close(ctx)
if err != nil {
panic("failed to close session1: " + err.Error())
}
// This unexpectedly fails with error "invalid transaction handle"
// because the shared connection has been reset.
_, err = tx2.Run(ctx, "MATCH (n) RETURN count(n) as nodeCount", nil)
if err != nil {
panic("failed to execute query in tx2: " + err.Error())
}
The problem seems to be that the onClosed
function set on each explicitTransaction
by SessionWithContext
unconditionally calls Pool.Return
for the connection even when is has already been returned, and Return
doesn't seem to be idempotent:
https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/session_with_context.go#L339-L345
The onClosed
function is called on every error, incl. when explicitTransaction.Run
fails:
https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/transaction_with_context.go#L70-L76
Pool.Return
unconditionally calls server.returnBusy
for the connection, and that is not idempotent.
https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/internal/pool/pool.go#L471
Rewriting server.returnBusy
to check for duplicates and make it idempotent does fix this bug:
// Returns a busy connection, makes it idle
func (s *server) returnBusy(c db.Connection) {
s.unregisterBusy(c)
// Don't re-add the connection into idle if it is already idle.
for e := s.idle.Front(); e != nil; e = e.Next() {
if c == e.Value.(db.Connection) {
return
}
}
s.idle.PushFront(c)
}
The same probably needs to be done for Bolt3 and Bolt5.
Thanks @rlenglet for the detailed analysis, we'll work on a fix today.
@rlenglet 5.9.0 is now out with this fix: https://github.com/neo4j/neo4j-go-driver/wiki/5.x-changelog#v590
5.9.0 is now out with this fix
@fbiville awesome, thanks!
Calling
SessionWithContext.BeginTransaction
on a newly created session returns this error:This is the assertion that fails: https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/internal/bolt/bolt4.go#L303-L305
This session seems to be reused from the pool after previously being used to execute explicit transactions, and to have not been reset property when returned into the pool. The implementation of
bolt4.TxBegin
seems to deal specifically with auto-commit queries, but not with explicit transactions: https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/internal/bolt/bolt4.go#L294-L301 I think this logic should be extended to deal with explicit transactions as well? Maybe replace:with:
???
I was previously using
v5.0.0-preview
where this bug didn't occur. This is a regression.Neo4j Version: 4.4.15-enterprise Neo4j Mode: Single instance
Driver version: Go driver 5.8.x Operating System: Ubuntu 16.10 on AWS