neo4j / neo4j-go-driver

Neo4j Bolt Driver for Go
Apache License 2.0
485 stars 68 forks source link

resetResponseHandler does not set an onIgnored handler #492

Closed timrulebosch closed 1 year ago

timrulebosch commented 1 year ago

The resetResponseHandler does not set an onIgnored callback. If that handler is called for a bolt message of type "ignored" then you get a SIGSEV as follows:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x5ec720]

goroutine 1 [running]:
github.com/neo4j/neo4j-go-driver/v5/neo4j/internal/bolt.(*messageQueue).receive(0xc0002020f8, {0x71e750, 0xc0000b6000})
        /home/trule/go/pkg/mod/github.com/neo4j/neo4j-go-driver/v5@v5.8.1/neo4j/internal/bolt/message_queue.go:166 +0x200
github.com/neo4j/neo4j-go-driver/v5/neo4j/internal/bolt.(*bolt4).ForceReset(0xc000202000, {0x71e750, 0xc0000b6000})
        /home/trule/go/pkg/mod/github.com/neo4j/neo4j-go-driver/v5@v5.8.1/neo4j/internal/bolt/bolt4.go:821 +0x1bc
github.com/neo4j/neo4j-go-driver/v5/neo4j/internal/bolt.(*bolt4).Reset(0xc000000140?, {0x71e750?, 0xc0000b6000?})
        /home/trule/go/pkg/mod/github.com/neo4j/neo4j-go-driver/v5@v5.8.1/neo4j/internal/bolt/bolt4.go:803 +0x7d
github.com/neo4j/neo4j-go-driver/v5/neo4j/internal/pool.(*Pool).Return(0xc000000140, {0x71e750, 0xc0000b6000}, {0x720e70?, 0xc000202000?})
        /home/trule/go/pkg/mod/github.com/neo4j/neo4j-go-driver/v5@v5.8.1/neo4j/internal/pool/pool.go:450 +0x2b2
github.com/neo4j/neo4j-go-driver/v5/neo4j.(*sessionWithContext).executeTransactionFunction.func1()
        /home/trule/go/pkg/mod/github.com/neo4j/neo4j-go-driver/v5@v5.8.1/neo4j/session_with_context.go:436 +0x36

The callback is thus (see onIgnored: nil):

(dlv) print callback
github.com/neo4j/neo4j-go-driver/v5/neo4j/internal/bolt.responseHandler {
    onSuccess: github.com/neo4j/neo4j-go-driver/v5/neo4j/internal/bolt.(*bolt4).resetResponseHandler.func1, 
    onRecord: nil, 
    onFailure: github.com/neo4j/neo4j-go-driver/v5/neo4j/internal/bolt.(*bolt4).resetResponseHandler.func2, 
    onUnknown: github.com/neo4j/neo4j-go-driver/v5/neo4j/internal/bolt.(*bolt4).resetResponseHandler.func3, 
    onIgnored: nil
}

Adding onIgnored: onIgnoredNoOp, to the resetResponseHandler() would likely fix the fault.

https://github.com/neo4j/neo4j-go-driver/blob/0eccfa3db544f6639151bf013cb08a3dde97fc70/neo4j/internal/bolt/bolt4.go#LL1078C1-L1091C2

The scenario behind this is fairly simple (and probably stupid) ... however, I'm new to Go ... based of the example, I basically run this function twice with the same parameters, the second time fails.

func addNodeFunc(ctx context.Context, labels []string, name string) neo4j.ManagedTransactionWork {
    return func(tx neo4j.ManagedTransaction) (any, error) {
        var b strings.Builder
        b.WriteString("CREATE (n:" + strings.Join(labels, ":") +" {name: $name}) ")
        b.WriteString("RETURN id(n) ")
        query := b.String()
        result, err := tx.Run(ctx, query, map[string]any{"name": name})
        if err != nil {
            return nil, err
        }
        return result.Consume(ctx)
    }
}
fbiville commented 1 year ago

@timrulebosch thanks for the report! I have not been to reproduce the issue just yet. What server version are you running?

Here are the steps I followed:

docker run --rm \
    --env NEO4J_AUTH='neo4j/letmein!' \
    --env NEO4J_ACCEPT_LICENSE_AGREEMENT=yes \
    --publish=7687:7687 \
    --health-cmd "cypher-shell -u neo4j -p 'letmein!' 'RETURN 1'" \
    --health-interval 5s \
    --health-timeout 5s \
    --health-retries 5 \
    neo4j:4.4-enterprise

And I ran the following program (with an invalid label):

package main

import (
    "context"
    "fmt"
    "github.com/neo4j/neo4j-go-driver/v5/neo4j"
    "strings"
)

func main() {
    ctx := context.Background()
    driver, err := neo4j.NewDriverWithContext("neo4j://localhost", neo4j.BasicAuth("neo4j", "letmein!", ""))
    if err != nil {
        panic(err)
    }
    defer dealWithIt(ctx, driver)
    session := driver.NewSession(ctx, neo4j.SessionConfig{})
    defer dealWithIt(ctx, session)
    for i := 0; i < 5; i++ {
        fmt.Printf("Call #%d\n", i+1)
        summary, err := session.ExecuteWrite(ctx, addNodeFunc(ctx, []string{"F o o", "Bar"}, "Jane Doe"))
        if err != nil {
            panic(err)
        }
        fmt.Printf("Node(s) created: %d\n", summary.(neo4j.ResultSummary).Counters().NodesCreated())
    }
}

func addNodeFunc(ctx context.Context, labels []string, name string) neo4j.ManagedTransactionWork {
    return func(tx neo4j.ManagedTransaction) (any, error) {
        var b strings.Builder
        b.WriteString("CREATE (n:" + strings.Join(labels, ":") + " {name: $name}) ")
        b.WriteString("RETURN id(n) ")
        query := b.String()
        result, err := tx.Run(ctx, query, map[string]any{"name": name})
        if err != nil {
            return nil, err
        }
        return result.Consume(ctx)
    }
}

func dealWithIt(ctx context.Context, closer interface{ Close(context.Context) error }) {
    if err := closer.Close(ctx); err != nil {
        panic(err)
    }
}

which gives me:

Call #1
panic: Neo4jError: Neo.ClientError.Statement.SyntaxError (Invalid input 'o': expected ")", "WHERE", "{" or a parameter (line 1, column 13 (offset: 12))
"CREATE (n:F o o:Bar {name: $name}) RETURN id(n)"
             ^)

goroutine 1 [running]:
main.main()

Would you be able to update the program and make it crash as in the initial report? Thanks a lot!

timrulebosch commented 1 year ago

Hi,

this was against the latest MemGraph container (memgraph/memgraph-platform). The query was valid, including labels. First run of the query passed, second run resulted in the exception. So, actually, the code was fine. If you don't have the problem, perhaps MemGraph is a factor? It really was as simple as:

CREATE (n:foo {name: "bar"}) RETURN id(n)

... but I included the other code incase that is somehow triggering the resetResponseHandler

However its clear that the resetResponseHandler() is not correctly specified. It does not specify the onIgnored function .... and then later when/if that "not specified" (nil) function is called, the exception is the result.

That seems like something which should be fixed regardless?

I will take a look at your code in the next days. Perhaps its really an interaction with MemGraph.

fbiville commented 1 year ago

According to the docs, the RESET message can never receive IGNORED messages: https://neo4j.com/docs/bolt/current/bolt/message/#messages-reset. The root issue is on Memgraph's side, I would say.

That being said, I will improve the driver for the next release to not immediately panic, but instead gracefully handle the error.

timrulebosch commented 1 year ago

Thanks!

If I understand it correctly, a RESET sent to the server might respond with either SUCCESS or FAILURE. The resetResponseHandler() is thus configured with those handlers (+ unknown). So its likely that MemGraph is responding (to the RESET) with an Ignore message. And that triggers an unfortunate sequence of events.

I will take a second look, and try to confirm that by tracing the BOLT protocol. Its probably as you suggest, a problem with the MemGraph implementation.

timrulebosch commented 1 year ago

I managed to get a BOLT trace, the C: RESET -> S: IGNORED sequence I guess causes this problem.

From my reading of the BOLT specification, an IGNORED message would be permissible ... but perhaps not expected in this case. https://neo4j.com/docs/bolt/current/bolt/message/#messages-ignored

Anyway, I should just fix my query! Thanks for the insight and assistance.

2023-05-23 10:47:12.821   BOLT  C: <MAGIC> 0X6060B017
2023-05-23 10:47:12.821   BOLT  C: <HANDSHAKE> 0X00020205 0X00020404 0X00000104 0X00000003
2023-05-23 10:47:12.822   BOLT  S: <HANDSHAKE> 0X00000304
2023-05-23 10:47:12.822   BOLT  C: HELLO {"credentials":"<redacted>","patch_bolt":["utc"],"principal":"","scheme":"basic","user_agent":"Go Driver/5.8.1"}
2023-05-23 10:47:12.822   BOLT  S: SUCCESS {"connection_id":"bolt-1"}
2023-05-23 10:47:12.822   BOLT  [bolt-1@localhost:7687] C: BEGIN {}
2023-05-23 10:47:12.823   BOLT  [bolt-1@localhost:7687] S: SUCCESS {}
2023-05-23 10:47:12.823   BOLT  [bolt-1@localhost:7687] C: RUN "CREATE (n:Model:FMU {name: $name}) RETURN id(n) AS id" {"name":"ESP10CP_sf"} null
2023-05-23 10:47:12.823   BOLT  [bolt-1@localhost:7687] C: PULL {"n":1000}
2023-05-23 10:47:12.824   BOLT  [bolt-1@localhost:7687] S: SUCCESS {"fields":["id"]}
2023-05-23 10:47:12.824   BOLT  [bolt-1@localhost:7687] S: RECORD [1446]
2023-05-23 10:47:12.824   BOLT  [bolt-1@localhost:7687] S: SUCCESS {}
2023-05-23 10:47:12.824   BOLT  [bolt-1@localhost:7687] C: COMMIT
2023-05-23 10:47:12.824   BOLT  [bolt-1@localhost:7687] S: SUCCESS {}
2023-05-23 10:47:12.824   BOLT  [bolt-1@localhost:7687] C: BEGIN {}
2023-05-23 10:47:12.824   BOLT  [bolt-1@localhost:7687] S: FAILURE {"code":"Memgraph.TransientError.MemgraphError.MemgraphError","message":"Unable to commit due to unique constraint violation on :Model(name)"}
2023-05-23 10:47:12.824   BOLT  [bolt-1@localhost:7687] C: RESET
2023-05-23 10:47:12.825   BOLT  [bolt-1@localhost:7687] S: IGNORED
2023-05-23 10:47:12.825   BOLT  [bolt-1@localhost:7687] C: GOODBYE
fbiville commented 1 year ago

Thanks for sharing that trace. It's interesting that the constraint violation shows up right after the 2nd BEGIN (and is classified as a transient error).

As I said earlier, IGNORED is not a valid response to RESET (see Valid Summary Messages): https://neo4j.com/docs/bolt/current/bolt/message/#messages-reset. Memgraph does not follow the spec here.