SessionWithContext.BeginTransaction on a new session fails with error "invalid state 3, expected: [0]"

rlenglet commented 1 year ago

Calling SessionWithContext.BeginTransaction on a newly created session returns this error:

invalid state 3, expected: [0]

This is the assertion that fails: https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/internal/bolt/bolt4.go#L303-L305

This session seems to be reused from the pool after previously being used to execute explicit transactions, and to have not been reset property when returned into the pool. The implementation of bolt4.TxBegin seems to deal specifically with auto-commit queries, but not with explicit transactions: https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/internal/bolt/bolt4.go#L294-L301 I think this logic should be extended to deal with explicit transactions as well? Maybe replace:

    if b.state == bolt4_streaming {

with:

    if b.state == bolt4_streaming || b.state == bolt4_streamingtx {

???

I was previously using v5.0.0-preview where this bug didn't occur. This is a regression.

Neo4j Version: 4.4.15-enterprise Neo4j Mode: Single instance
Driver version: Go driver 5.8.x Operating System: Ubuntu 16.10 on AWS

rlenglet commented 1 year ago

I tried changing if b.state == bolt4_streaming { into if b.state == bolt4_streaming || b.state == bolt4_streamingtx {, I don't see the previous error, but now I observe that some sessions can be in yet another state in the pool ("transaction pending"):

invalid state 2, expected: [0]

I.e. after reading the pending explicit transaction's stream, the transaction is still pending.

Even if I try to force the state to change to bolt4_ready when in that state, the session is unusable. So this doesn't solve the root issue which is that a session with a pending transaction was returned into the pool.

rlenglet commented 1 year ago

A cleaner to implement this change would be replace: https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/internal/bolt/bolt4.go#L294-L301 with:

    // Makes all outstanding streams invalid
    b.discardAllStreams(ctx)

rlenglet commented 1 year ago

The bug seems to be that ForceReset does not wait anymore for the response to the reset message, and that causes connections in the tx / streamingtx state to be returned into the pool.

The implementation of ForceReset now waits for any message, but doesn't verify anymore that that message is a response to the reset: https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/internal/bolt/bolt4.go#L817-L823

Unlike in 5.0.0-preview, where the implementation does wait explicitly for a reset response: https://github.com/neo4j/neo4j-go-driver/blob/ca573bc36a8c721cfd8e0fc236eb6b9aaeed5555/neo4j/internal/bolt/bolt4.go#L889-L907

rlenglet commented 1 year ago

I added calls to Reset(ctx) on any connection returned by the Pool in Borrow() here: https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/internal/pool/pool.go#L279 and in tryBorrow here: https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/internal/pool/pool.go#L329

This now does have the intended effect of always resetting all connections used in new SessionWithContexts.

But now, I see new errors in pending sessions, e.g. when calling Rollback on an existing ExplicitTransaction on another open session:

invalid transaction handle

I.e. creating a new session from the same driver/pool as another existing session resets the Bolt4 connection of that existing session.

The same Bolt4 connection seems to be shared by two different SessionWithContexts.

fbiville commented 1 year ago

Hello, thanks for the report.

Unfortunately, I cannot reproduce your initial scenario:

docker run --rm \
    --env NEO4J_AUTH='neo4j/letmein!' \
    --env NEO4J_ACCEPT_LICENSE_AGREEMENT=yes \
    --publish=7687:7687 \
    --health-cmd "cypher-shell -u neo4j -p 'letmein!' 'RETURN 1'" \
    --health-interval 5s \
    --health-timeout 5s \
    --health-retries 5 \
    neo4j:4.4.15-enterprise

package main

import (
    "context"
    "fmt"
    "github.com/neo4j/neo4j-go-driver/v5/neo4j"
)

func main() {
    ctx := context.Background()
    auth := neo4j.BasicAuth("neo4j", "letmein!", "")
    driver, err := neo4j.NewDriverWithContext("neo4j://localhost", auth)
    if err != nil {
        panic(err)
    }
    defer closeNeo4jResource(ctx, driver)
    session := driver.NewSession(ctx, neo4j.SessionConfig{})
    defer closeNeo4jResource(ctx, session)
    for i := 0; i < 5; i++ {
        tx, err := session.BeginTransaction(ctx)
        if err != nil {
            panic(err)
        }
        result, err := tx.Run(ctx, "CREATE (:Node)", nil)
        if err != nil {
            panic(err)
        }
        summary, err := result.Consume(ctx)
        if err != nil {
            panic(err)
        }
        fmt.Printf("Nodes created: %d\n", summary.Counters().NodesCreated())
        if err := tx.Commit(ctx); err != nil {
            panic(err)
        }
    }
}

func closeNeo4jResource(ctx context.Context, closeable interface{ Close(context.Context) error }) {
    if err := closeable.Close(ctx); err != nil {
        panic(err)
    }
}

Would you be able to adapt the above sample to create a reproducer?

fbiville commented 1 year ago

I could not reproduce the issue even when using goroutines:

package main

import (
    "context"
    "fmt"
    "github.com/neo4j/neo4j-go-driver/v5/neo4j"
    "sync"
)

func main() {
    ctx := context.Background()
    auth := neo4j.BasicAuth("neo4j", "letmein!", "")
    driver, err := neo4j.NewDriverWithContext("neo4j://localhost", auth)
    if err != nil {
        panic(err)
    }
    defer closeNeo4jResource(ctx, driver)
    session := driver.NewSession(ctx, neo4j.SessionConfig{})
    defer closeNeo4jResource(ctx, session)
    goRoutines := 5
    var group sync.WaitGroup
    group.Add(goRoutines)
    for i := 0; i < goRoutines; i++ {
        go func(i int) {
            defer group.Done()
            tx, err := session.BeginTransaction(ctx)
            if err != nil {
                panic(err)
            }
            result, err := tx.Run(ctx, "CREATE (:Node)", nil)
            if err != nil {
                panic(err)
            }
            summary, err := result.Consume(ctx)
            if err != nil {
                panic(err)
            }
            fmt.Printf("Nodes created in run %d: %d\n", i, summary.Counters().NodesCreated())
            if err := tx.Commit(ctx); err != nil {
                panic(err)
            }
        }(i)
    }
    group.Wait()
}

func closeNeo4jResource(ctx context.Context, closeable interface{ Close(context.Context) error }) {
    if err := closeable.Close(ctx); err != nil {
        panic(err)
    }
}

I will likely need more information to proceed.

rlenglet commented 1 year ago

I can reproduce the bug 100% reliably with one test suite in our non-public codebase. But it doesn’t trigger anymore whenever I comment out any test case, so I’ve not been able to reduce the reproducer yet.

The good news is that I can try out solutions and debug.

This doesn’t seem to be a multi-threading issue. Our test suite uses a single goroutine. That one goroutine begins and commits/rollbacks multiple explicit transactions concurrently, all to the same single Neo4j server.

rlenglet commented 1 year ago

Our particular test case that causes a failure does something like this in a single goroutine, using a single driver:

New session1
session1.BeginTransaction
session1.Run
New session2
session2.Run
session2.Commit
session2.Close
session1.Rollback
session1.Close

Both sessions seem to share the same connection somehow. The driver’s state seems to be erroneous before this code/test case is executed.

The other code/test cases executed before this one execute similar sequences of actions and that seems to cause the driver/pool to be erroneous.

At this point, I suspect that a Bolt4 connection is added twice into the idle queue.

fbiville commented 1 year ago

Thanks for the additional details, there are similar stress tests in the Go driver for this, but I'll see if I can reproduce the problem with the described setup.

fbiville commented 1 year ago

So, I updated the first sample to follow more closely the scenario you described:

package main

import (
    "context"
    "fmt"
    "github.com/neo4j/neo4j-go-driver/v5/neo4j"
)

func main() {
    ctx := context.Background()
    auth := neo4j.BasicAuth("neo4j", "letmein!", "")
    driver, err := neo4j.NewDriverWithContext("neo4j://localhost", auth)
    panicOnErr(err)
    defer closeNeo4jResource(ctx, driver)
    session1 := driver.NewSession(ctx, neo4j.SessionConfig{})
    tx1, err := session1.BeginTransaction(ctx)
    panicOnErr(err)
    result1, err := tx1.Run(ctx, "CREATE (:Node)", nil)
    panicOnErr(err)
    session2 := driver.NewSession(ctx, neo4j.SessionConfig{})
    tx2, err := session2.BeginTransaction(ctx)
    panicOnErr(err)
    result2, err := tx2.Run(ctx, "CREATE (:Node)", nil)
    panicOnErr(err)
    printNodesCreated(ctx, result2)
    panicOnErr(tx2.Commit(ctx))
    closeNeo4jResource(ctx, session2)
    printNodesCreated(ctx, result1)
    panicOnErr(tx1.Rollback(ctx))
    closeNeo4jResource(ctx, session1)
}

func printNodesCreated(ctx context.Context, result neo4j.ResultWithContext) {
    summary1, err := result.Consume(ctx)
    panicOnErr(err)
    fmt.Printf("Nodes created: %d\n", summary1.Counters().NodesCreated())
}

func closeNeo4jResource(ctx context.Context, closeable interface{ Close(context.Context) error }) {
    panicOnErr(closeable.Close(ctx))
}

func panicOnErr(err error) {
    if err != nil {
        panic(err)
    }
}

Unfortunately, I still cannot reproduce the issue. Could you please try adapt this specific sample, see if you can reproduce the issue and share the resulting program? Thanks a lot!

rlenglet commented 1 year ago

I did more debugging. At the point of failure, I observe that the driver's busy and idle queues contain a total of 3 connections:

busy contains bolt-0
idle contains bolt-0 and bolt-1

So bolt-0 is duplicated in busy and idle, which should never happen.

rlenglet commented 1 year ago

In other points, I see idle for the server containing bolt-1 once and bolt-0 twice.

rlenglet commented 1 year ago

Here's how to reproduce: Call Run on a transaction/session that is already closed. That causes the connection to be added again into the idle queue.

    session := driver.NewSession(ctx, neo4j.SessionConfig{
        DatabaseName: "neo4j",
        AccessMode:   neo4j.AccessModeRead,
    })

    tx, err := session.BeginTransaction(ctx)
    if err != nil {
        panic("failed to begin transaction: " + err.Error())
    }

    _, err = tx.Run(ctx, "MATCH (n) RETURN count(n) as nodeCount", nil)
    if err != nil {
        panic("failed to execute query: " + err.Error())
    }

    err = tx.Commit(ctx)
    if err != nil {
        panic("failed to commit transaction: " + err.Error())
    }

    err = session.Close(ctx)
    if err != nil {
        panic("failed to close session: " + err.Error())
    }

    // Execute the query again on the closed tx/session.
    // This returns an error as expected.
    //
    // ***** This causes the connection to be added incorrectly twice into the idle queue. *****
    //
    _, _ = tx.Run(ctx, "MATCH (n) RETURN count(n) as nodeCount", nil)

    // Start two new sessions and begin a transaction in each.
    // Those will share the same connection.
    session1 := driver.NewSession(ctx, neo4j.SessionConfig{
        DatabaseName: "neo4j",
        AccessMode:   neo4j.AccessModeRead,
    })

    session2 := driver.NewSession(ctx, neo4j.SessionConfig{
        DatabaseName: "neo4j",
        AccessMode:   neo4j.AccessModeRead,
    })

    tx1, err := session1.BeginTransaction(ctx)
    if err != nil {
        panic("failed to begin transaction on session1: " + err.Error())
    }

    // ***** This transaction incorrectly shares the same connection as tx1. *****
    tx2, err := session2.BeginTransaction(ctx)
    if err != nil {
        panic("failed to begin transaction on session2: " + err.Error())
    }

    _, err = tx1.Run(ctx, "MATCH (n) RETURN count(n) as nodeCount", nil)
    if err != nil {
        panic("failed to execute query in tx1: " + err.Error())
    }

    err = tx1.Rollback(ctx)
    if err != nil {
        panic("failed to commit tx1: " + err.Error())
    }

    err = session1.Close(ctx)
    if err != nil {
        panic("failed to close session1: " + err.Error())
    }

    // This unexpectedly fails with error "invalid transaction handle"
    // because the shared connection has been reset.
    _, err = tx2.Run(ctx, "MATCH (n) RETURN count(n) as nodeCount", nil)
    if err != nil {
        panic("failed to execute query in tx2: " + err.Error())
    }

rlenglet commented 1 year ago

The problem seems to be that the onClosed function set on each explicitTransaction by SessionWithContext unconditionally calls Pool.Return for the connection even when is has already been returned, and Return doesn't seem to be idempotent: https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/session_with_context.go#L339-L345

The onClosed function is called on every error, incl. when explicitTransaction.Run fails: https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/transaction_with_context.go#L70-L76

rlenglet commented 1 year ago

Pool.Return unconditionally calls server.returnBusy for the connection, and that is not idempotent. https://github.com/neo4j/neo4j-go-driver/blob/0f1f1e867c614b688394d50e8a9dff16c25f0569/neo4j/internal/pool/pool.go#L471

rlenglet commented 1 year ago

Rewriting server.returnBusy to check for duplicates and make it idempotent does fix this bug:

// Returns a busy connection, makes it idle
func (s *server) returnBusy(c db.Connection) {
    s.unregisterBusy(c)

    // Don't re-add the connection into idle if it is already idle.
    for e := s.idle.Front(); e != nil; e = e.Next() {
        if c == e.Value.(db.Connection) {
            return
        }
    }

    s.idle.PushFront(c)
}

rlenglet commented 1 year ago

The same probably needs to be done for Bolt3 and Bolt5.

fbiville commented 1 year ago

Thanks @rlenglet for the detailed analysis, we'll work on a fix today.

fbiville commented 1 year ago

@rlenglet 5.9.0 is now out with this fix: https://github.com/neo4j/neo4j-go-driver/wiki/5.x-changelog#v590

rlenglet commented 1 year ago

5.9.0 is now out with this fix

@fbiville awesome, thanks!

neo4j / neo4j-go-driver

SessionWithContext.BeginTransaction on a new session fails with error "invalid state 3, expected: [0]" #487