johnnadratowski / golang-neo4j-bolt-driver

Golang Bolt driver for Neo4j
MIT License
213 stars 72 forks source link

Bad connection driver error #57

Open collisonchris opened 6 years ago

collisonchris commented 6 years ago

I've been using the driver in an application for about 5 months and have had very few issues. However, recently I've noticed I've been getting bad connection errors. It always seems to be pop up from the decoder read() call. The error and stack trace look like this: Error Message Couldn't read expected bytes for message length. Read: 0 Expected: 2. Internal Error(*errors.errorString):driver: bad connection Stack Trace

Stack Trace:

goroutine 159 [running]:
runtime/debug.Stack(0x5557536b466f, 0x46, 0xc4204c54d0)
    /usr/lib/go/src/runtime/debug/stack.go:24 +0xa9
<my_package_structure>/vendor/github.com/johnnadratowski/golang-neo4j-bolt-driver/errors.Wrap(0x555753ad5120, 0xc4202204a0, 0x5557536b466f, 0x46, 0xc4204c54d0, 0x1, 0x1, 0xc42139f420)
    /go/src/<my_package_structure>/vendor/github.com/johnnadratowski/golang-neo4j-bolt-driver/errors/errors.go:38 +0x18a
<my_package_structure>/vendor/github.com/johnnadratowski/golang-neo4j-bolt-driver/encoding.Decoder.read(0x555753ad5440, 0xc42139f420, 0xc4201461c0, 0x5557531abada, 0x70, 0x555753a73b20)
    /go/src/<my_package_structure>/vendor/github.com/johnnadratowski/golang-neo4j-bolt-driver/encoding/decoder.go:45 +0x420
<my_package_structure>/vendor/github.com/johnnadratowski/golang-neo4j-bolt-driver/encoding.Decoder.Decode(0x555753ad5440, 0xc42139f420, 0xc4201461c0, 0xc42139f420, 0xc420146150, 0xffff, 0x555753a40d60)
    /go/src/<my_package_structure>/vendor/github.com/johnnadratowski/golang-neo4j-bolt-driver/encoding/decoder.go:93 +0x45
<my_package_structure>/vendor/github.com/johnnadratowski/golang-neo4j-bolt-driver.(*boltConn).consume(0xc42139f420, 0x5557536b731a, 0x4b8, 0xc4262329f0, 0x0)
    /go/src/<my_package_structure>/vendor/github.com/johnnadratowski/golang-neo4j-bolt-driver/conn.go:550 +0xb4
<my_package_structure>/vendor/github.com/johnnadratowski/golang-neo4j-bolt-driver.(*boltConn).sendRunPullAllConsumeAll(0xc42139f420, 0x5557536b731a, 0x4b8, 0xc4262329f0, 0x5557531abbbc, 0x480, 0x8, 0x10, 0xc42055e6b0, 0x0, ...)
    /go/src/<my_package_structure>/vendor/github.com/johnnadratowski/golang-neo4j-bolt-driver/conn.go:697 +0x70
<my_package_structure>/vendor/github.com/johnnadratowski/golang-neo4j-bolt-driver.(*boltStmt).ExecNeo(0xc42091ee40, 0xc4262329f0, 0x55575369a780, 0xf, 0xc4205fa898, 0x0)

I am running in the pooled connection mode with 5 connections currently. The code below is how I am building the pool and retrieving a connection object from the pool.

var boltConnectionString = "<sanitized_connection_string>"

type NeoClient struct {
    Pool bolt.DriverPool
}

func NewNeo4jClient(hostname string, port string, user string, password string) (NeoClient, error) {
    pool, err := bolt.NewDriverPool(boltConnectionString, 5)
    boltLogs.SetLevel("error")
    return NeoClient{pool}, err
}

func getConnectionFromPool(pool bolt.DriverPool) bolt.Conn {
    conn, err := pool.OpenPool()
    if (err != nil) {
        platform.Log().WithError(err).Error("Unable to get connection from Pool!")
    }
    return conn
}

The errors seem to happen in very short bursts and are consistently come in groups of 5 and always very close together(usually same millisecond), guessing this is all 5 of my connections in the pool being closed/resetting? I don't see any errors on the Neo4j server logs side about closing connection, connection reset by peer etc.

Any ideas as to what's going on here? Are there any configuration changes I could make that might help with this?

markuszm commented 6 years ago

I am also getting these type of errors. What type of queries/statements do you execute to trigger these driver errors?

In my case it happens with very long running statements (a call to apoc.refactor.mergeNodes) and I believe the connection times out as it is set at 1 Minute by default. Still, after such an error I get these bad driver errors for the next queries too. I am not sure if it is also a problem with the server hanging with heavy calculations as I am running neo4j locally in a Docker container on my dual-core laptop.

collisonchris commented 6 years ago

Our scenario is quite different. We're ingesting messages off a message broker and using a persistent connection pool to send a merge to our Neo4j instance. Our Neo4j instance is a VM and our app that uses the driver is in a docker container in a Kubernetes pod. Most of the time it's fine and processes messages smoothly but after long layoffs these messages tend to pop up. I've been digging into the Bolt protocol details here: https://boltprotocol.org/v1/#examples

Basically it looks like at some point during query issue the connection has gone stale or fails in some way. Line 697 in conn.go is where the errors begin:

func (c *boltConn) sendRunPullAllConsumeAll(query string, args map[string]interface{}) (interface{}, interface{}, []interface{}, error) {
    err := c.sendRunPullAll(query, args)
    if err != nil {
        return nil, nil, nil, err
    }

    runSuccess, err := c.consume()//LINE 697
    if err != nil {
        return runSuccess, nil, nil, err
    }

    records, pullSuccess, err := c.consumeAll()
    return runSuccess, pullSuccess, records, err
}

If it made it that far it's already sent the RUN and the PULLALL messages and gotten responses successfully in the sendRunPullAll call, it appears to only fail when it goes to consume in the RECORD step from the server.

I'm gonna turn on INFO level logging in our environment and see if it gives any more details.

collisonchris commented 6 years ago

After digging into this more it appears that connections do become stale after a sufficient idle period. I came up with a couple of different ideas how to deal with the issue. The first is to check the to see if the underlying connection can be read from before returning it. The existing code was doing a nil check on the connection before returning it which wouldn't detect io.EOF errors or timed out connections.

func connectionNilOrClosed(conn *boltConn) (bool) {
    if(conn.conn == nil) {//nil check before attempting read
        return true
    }
    conn.conn.SetReadDeadline(time.Now())
    zero := make ([]byte, 0)
    _, err := conn.conn.Read(zero)//read zero bytes to validate connection is still alive
    if err != nil {
        log.Error("Bad Connection state detected", err)//the error caught here could be a io.EOF or a timeout, either way we want to log the error & return true
        return true
    }
    return false
}

This should take care of the the driver: bad connection errors we've been seeing. PR for this change is here: #59

If this isn't successful I will try to downcast the connection to a net.TCPConn and turn on keep alive behavior. I tinkered with this locally but it didn't seem to work as I expected; may be worth following up on if this doesn't work.

hhoughgg commented 5 years ago

I'm still having the same issue after this change.

Error: An error occurred getting result of exec command: <nil> Couldn't read expected bytes for message length. Read: 0 Expected: 2. Internal Error(*errors.errorString):driver: bad connection

I assume that reading zero bytes isn't enough to actually detect that the connection is bad since it seems to fail on a read in consume right after that.

collisonchris commented 5 years ago

Yeah, this doesn't seem to fix the issue in our environment either. Believe it's a sync/cross-threading issue, I've been able to reproduce this issue with this simple POC: https://github.com/collisonchris/neo4j-test

The docs do call out that connection objects are not thread-safe and would require synchronized access on the user side to safely access a connection across goroutines.

hhoughgg commented 5 years ago

I would think with OpenPool() you should be ok since you should receive a new connection each call to OpenPool() even across goroutines. In your example you are using one bolt connection per goroutine and closing it after. Perhaps OpenPool() can return the same connection multiple times if you call it too quickly?

collisonchris commented 5 years ago

Yeah this looks like a type of unintended concurrent access of the underlying Conns; multiple go routines are hitting the same connection and execute queries in an interleaved fashion. The log output of the simple POC I posted seems to reproduce it:

Healthy cycle:

[BOLT][INFO]2018/10/30 02:54:40 Sending INIT Message. ClientID: GolangNeo4jBolt/1.0 User: neo4j Password: <scrubbed>
[BOLT][INFO]2018/10/30 02:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 02:54:40 Successfully initiated Bolt connection: {Metadata:map[server:Neo4j/3.3.5]}
[BOLT][INFO]2018/10/30 02:54:40 Sending RUN message: query return 1 (args: map[string]interface {}(nil))
[BOLT][INFO]2018/10/30 02:54:40 Sending PULL_ALL message
[BOLT][INFO]2018/10/30 02:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 02:54:40 Consuming all responses until success/failure
[BOLT][INFO]2018/10/30 02:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 02:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 02:54:40 Got success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
[BOLT][INFO]2018/10/30 02:54:40 Got run success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_available_after":1, "fields":[]interface {}{"1"}}}
[BOLT][INFO]2018/10/30 02:54:40 Got discard all success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
INFO[18000] healthy ping

Interleaved cycle:

[BOLT][INFO]2018/10/30 06:54:40 Sending INIT Message. ClientID: GolangNeo4jBolt/1.0 User: neo4j Password: <scrubbed>
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Successfully initiated Bolt connection: {Metadata:map[server:Neo4j/3.3.5]}
[BOLT][INFO]2018/10/30 06:54:40 Sending RUN message: query return 1 (args: map[string]interface {}(nil))
[BOLT][INFO]2018/10/30 06:54:40 Sending RUN message: query return 1 (args: map[string]interface {}(nil))
[BOLT][INFO]2018/10/30 06:54:40 Sending RUN message: query return 1 (args: map[string]interface {}(nil))
[BOLT][INFO]2018/10/30 06:54:40 Sending RUN message: query return 1 (args: map[string]interface {}(nil))
[BOLT][INFO]2018/10/30 06:54:40 Sending PULL_ALL message
[BOLT][INFO]2018/10/30 06:54:40 Sending PULL_ALL message
[BOLT][INFO]2018/10/30 06:54:40 Sending PULL_ALL message
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Sending PULL_ALL message
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming all responses until success/failure
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Got success message: messages.SuccessMessage{Metadata:map[string]interface {}{"type":"r", "result_consumed_after":0}}
[BOLT][INFO]2018/10/30 06:54:40 Got run success message: messages.SuccessMessage{Metadata:map[string]interface {}{"fields":[]interface {}{"1"}, "result_available_after":0}}
[BOLT][INFO]2018/10/30 06:54:40 Got discard all success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
INFO[32400] healthy ping
[BOLT][INFO]2018/10/30 06:54:40 Sending RUN message: query return 1 (args: map[string]interface {}(nil))
[BOLT][INFO]2018/10/30 06:54:40 Consuming all responses until success/failure
[BOLT][INFO]2018/10/30 06:54:40 Consuming all responses until success/failure
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming all responses until success/failure
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Got success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Got run success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_available_after":0, "fields":[]interface {}{"1"}}}
[BOLT][INFO]2018/10/30 06:54:40 Got success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
[BOLT][INFO]2018/10/30 06:54:40 Got discard all success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
INFO[32400] healthy ping
[BOLT][INFO]2018/10/30 06:54:40 Got run success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_available_after":0, "fields":[]interface {}{"1"}}}
[BOLT][INFO]2018/10/30 06:54:40 Sending RUN message: query return 1 (args: map[string]interface {}(nil))
[BOLT][INFO]2018/10/30 06:54:40 Got success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
[BOLT][INFO]2018/10/30 06:54:40 Got run success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_available_after":0, "fields":[]interface {}{"1"}}}
[BOLT][INFO]2018/10/30 06:54:40 Got discard all success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
[BOLT][INFO]2018/10/30 06:54:40 Got discard all success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
INFO[32400] healthy ping
[BOLT][INFO]2018/10/30 06:54:40 Sending PULL_ALL message
INFO[32400] healthy ping
[BOLT][INFO]2018/10/30 06:54:40 Sending PULL_ALL message
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming all responses until success/failure
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming all responses until success/failure
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Got success message: messages.SuccessMessage{Metadata:map[string]interface {}{"type":"r", "result_consumed_after":0}}
[BOLT][INFO]2018/10/30 06:54:40 Got run success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_available_after":1, "fields":[]interface {}{"1"}}}
[BOLT][INFO]2018/10/30 06:54:40 Got success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":1, "type":"r"}}
[BOLT][INFO]2018/10/30 06:54:40 Got discard all success message: messages.SuccessMessage{Metadata:map[string]interface {}{"type":"r", "result_consumed_after":0}}
[BOLT][INFO]2018/10/30 06:54:40 Got run success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_available_after":0, "fields":[]interface {}{"1"}}}
[BOLT][INFO]2018/10/30 06:54:40 Got discard all success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":1, "type":"r"}}
INFO[32400] healthy ping
INFO[32400] healthy ping
ERRO[32421] error with health ping                        
error="Couldn't read expected bytes for message length. Read: 0 Expected: 2.\nInternal Error(*errors.errorString):driver: bad connection\n\n Stack Trace:\n\ngoroutine 26 [running]:\nruntime/debug.Stack(0x12fc078, 0x46, 0xc4201e3c20)\n\t/usr/local/opt/go/libexec/src/runtime/debug/stack.go:24 +0xa7\ngithub.com/johnnadratowski/golang-neo4j-bolt-driver/errors.Wrap(0x1322b80, 0xc420096b00, 0x12fc078, 0x46, 0xc4201e3c20, 0x1, 0x1, 0xc420096b00)\n\t/Users/z0031yj/go/src/github.com/johnnadratowski/golang-neo4j-bolt-driver/errors/errors.go:38 +0x188\ngithub.com/johnnadratowski/golang-neo4j-bolt-driver/encoding.Decoder.read(0x1322ba0, 0xc4202a6b60, 0xc420378690, 0x1012008, 0x70, 0x12bca60)\n\t/Users/z0031yj/go/src/github.com/johnnadratowski/golang-neo4j-bolt-driver/encoding/decoder.go:45 +0x422\ngithub.com/johnnadratowski/golang-neo4j-bolt-driver/encoding.Decoder.Decode(0x1322ba0, 0xc4202a6b60, 0xc420378690, 0xc4202a6b60, 0xc420378620, 0xffff, 0x12a09e0)\n\t/Users/z0031yj/go/src/github.com/johnnadratowski/golang-neo4j-bolt-driver/encoding/decoder.go:93 +0x43\ngithub.com/johnnadratowski/golang-neo4j-bolt-driver.(*boltConn).consume(0xc4202a6b60, 0x12ebe79, 0x8, 0x0, 0x0)\n\t/Users/z0031yj/go/src/github.com/johnnadratowski/golang-neo4j-bolt-driver/conn.go:552 +0xb2\ngithub.com/johnnadratowski/golang-neo4j-bolt-driver.(*boltConn).sendRunPullAllConsumeAll(0xc4202a6b60, 0x12ebe79, 0x8, 0x0, 0xc42019a028, 0xc4201e3e30, 0x1029dc4, 0x13015d0, 0xc4201e3e40, 0xc420077080, ...)\n\t/Users/z0031yj/go/src/github.com/johnnadratowski/golang-neo4j-bolt-driver/conn.go:699 +0x6e\ngithub.com/johnnadratowski/golang-neo4j-bolt-driver.(*boltStmt).ExecNeo(0xc4201e3eb8, 0x0, 0xc4201e3eb8, 0xc4201e3ef8, 0x124c235, 0xc4200b0320)\n\t/Users/z0031yj/go/src/github.com/johnnadratowski/golang-neo4j-bolt-driver/stmt.go:99 +0x7c\ngithub.com/johnnadratowski/golang-neo4j-bolt-driver.(*boltConn).ExecNeo(0xc4202a6b60, 0x12ebe79, 0x8, 0x0, 0x0, 0x0, 0x0, 0x0)\n\t/Users/z0031yj/go/src/github.com/johnnadratowski/golang-neo4j-bolt-driver/conn.go:842 +0xcc\nmain.ping(0x13243c0, 0xc4200b0320)\n\t<scrubbed>/neo4j-test/main.go:57 +0x7b\nmain.aliveTask(0x13243c0, 0xc4200b0320)\n\t/<scrubbed>/neo4j-test/main.go:50 +0x7a\ncreated by main.main\n\t<scrubbed>/neo4j-test/main.go:41 +0xde\n"

You'll also see successful interleaved queries at times too, I'm not sure what underlying conditions are causing the issue.

hhoughgg commented 5 years ago

Reading over the documentation it looks like you are meant to close the drivers from the pool. I saw there is another CloseableDriverPool which is not thread safe. Once I stopped closing the pool connections it stopped giving me issues but I am also using a different server.