Open collisonchris opened 6 years ago
I am also getting these type of errors. What type of queries/statements do you execute to trigger these driver errors?
In my case it happens with very long running statements (a call to apoc.refactor.mergeNodes) and I believe the connection times out as it is set at 1 Minute by default. Still, after such an error I get these bad driver errors for the next queries too. I am not sure if it is also a problem with the server hanging with heavy calculations as I am running neo4j locally in a Docker container on my dual-core laptop.
Our scenario is quite different. We're ingesting messages off a message broker and using a persistent connection pool to send a merge to our Neo4j instance. Our Neo4j instance is a VM and our app that uses the driver is in a docker container in a Kubernetes pod. Most of the time it's fine and processes messages smoothly but after long layoffs these messages tend to pop up. I've been digging into the Bolt protocol details here: https://boltprotocol.org/v1/#examples
Basically it looks like at some point during query issue the connection has gone stale or fails in some way. Line 697 in conn.go
is where the errors begin:
func (c *boltConn) sendRunPullAllConsumeAll(query string, args map[string]interface{}) (interface{}, interface{}, []interface{}, error) {
err := c.sendRunPullAll(query, args)
if err != nil {
return nil, nil, nil, err
}
runSuccess, err := c.consume()//LINE 697
if err != nil {
return runSuccess, nil, nil, err
}
records, pullSuccess, err := c.consumeAll()
return runSuccess, pullSuccess, records, err
}
If it made it that far it's already sent the RUN
and the PULLALL
messages and gotten responses successfully in the sendRunPullAll
call, it appears to only fail when it goes to consume in the RECORD
step from the server.
I'm gonna turn on INFO
level logging in our environment and see if it gives any more details.
After digging into this more it appears that connections do become stale after a sufficient idle period. I came up with a couple of different ideas how to deal with the issue. The first is to check the to see if the underlying connection can be read from before returning it. The existing code was doing a nil check on the connection before returning it which wouldn't detect io.EOF
errors or timed out connections.
func connectionNilOrClosed(conn *boltConn) (bool) {
if(conn.conn == nil) {//nil check before attempting read
return true
}
conn.conn.SetReadDeadline(time.Now())
zero := make ([]byte, 0)
_, err := conn.conn.Read(zero)//read zero bytes to validate connection is still alive
if err != nil {
log.Error("Bad Connection state detected", err)//the error caught here could be a io.EOF or a timeout, either way we want to log the error & return true
return true
}
return false
}
This should take care of the the driver: bad connection
errors we've been seeing. PR for this change is here: #59
If this isn't successful I will try to downcast the connection to a net.TCPConn
and turn on keep alive behavior. I tinkered with this locally but it didn't seem to work as I expected; may be worth following up on if this doesn't work.
I'm still having the same issue after this change.
Error: An error occurred getting result of exec command: <nil> Couldn't read expected bytes for message length. Read: 0 Expected: 2. Internal Error(*errors.errorString):driver: bad connection
I assume that reading zero bytes isn't enough to actually detect that the connection is bad since it seems to fail on a read in consume right after that.
Yeah, this doesn't seem to fix the issue in our environment either. Believe it's a sync/cross-threading issue, I've been able to reproduce this issue with this simple POC: https://github.com/collisonchris/neo4j-test
The docs do call out that connection objects are not thread-safe and would require synchronized access on the user side to safely access a connection across goroutines.
I would think with OpenPool() you should be ok since you should receive a new connection each call to OpenPool() even across goroutines. In your example you are using one bolt connection per goroutine and closing it after. Perhaps OpenPool() can return the same connection multiple times if you call it too quickly?
Yeah this looks like a type of unintended concurrent access of the underlying Conns; multiple go routines are hitting the same connection and execute queries in an interleaved fashion. The log output of the simple POC I posted seems to reproduce it:
Healthy cycle:
[BOLT][INFO]2018/10/30 02:54:40 Sending INIT Message. ClientID: GolangNeo4jBolt/1.0 User: neo4j Password: <scrubbed>
[BOLT][INFO]2018/10/30 02:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 02:54:40 Successfully initiated Bolt connection: {Metadata:map[server:Neo4j/3.3.5]}
[BOLT][INFO]2018/10/30 02:54:40 Sending RUN message: query return 1 (args: map[string]interface {}(nil))
[BOLT][INFO]2018/10/30 02:54:40 Sending PULL_ALL message
[BOLT][INFO]2018/10/30 02:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 02:54:40 Consuming all responses until success/failure
[BOLT][INFO]2018/10/30 02:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 02:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 02:54:40 Got success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
[BOLT][INFO]2018/10/30 02:54:40 Got run success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_available_after":1, "fields":[]interface {}{"1"}}}
[BOLT][INFO]2018/10/30 02:54:40 Got discard all success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
INFO[18000] healthy ping
Interleaved cycle:
[BOLT][INFO]2018/10/30 06:54:40 Sending INIT Message. ClientID: GolangNeo4jBolt/1.0 User: neo4j Password: <scrubbed>
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Successfully initiated Bolt connection: {Metadata:map[server:Neo4j/3.3.5]}
[BOLT][INFO]2018/10/30 06:54:40 Sending RUN message: query return 1 (args: map[string]interface {}(nil))
[BOLT][INFO]2018/10/30 06:54:40 Sending RUN message: query return 1 (args: map[string]interface {}(nil))
[BOLT][INFO]2018/10/30 06:54:40 Sending RUN message: query return 1 (args: map[string]interface {}(nil))
[BOLT][INFO]2018/10/30 06:54:40 Sending RUN message: query return 1 (args: map[string]interface {}(nil))
[BOLT][INFO]2018/10/30 06:54:40 Sending PULL_ALL message
[BOLT][INFO]2018/10/30 06:54:40 Sending PULL_ALL message
[BOLT][INFO]2018/10/30 06:54:40 Sending PULL_ALL message
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Sending PULL_ALL message
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming all responses until success/failure
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Got success message: messages.SuccessMessage{Metadata:map[string]interface {}{"type":"r", "result_consumed_after":0}}
[BOLT][INFO]2018/10/30 06:54:40 Got run success message: messages.SuccessMessage{Metadata:map[string]interface {}{"fields":[]interface {}{"1"}, "result_available_after":0}}
[BOLT][INFO]2018/10/30 06:54:40 Got discard all success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
INFO[32400] healthy ping
[BOLT][INFO]2018/10/30 06:54:40 Sending RUN message: query return 1 (args: map[string]interface {}(nil))
[BOLT][INFO]2018/10/30 06:54:40 Consuming all responses until success/failure
[BOLT][INFO]2018/10/30 06:54:40 Consuming all responses until success/failure
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming all responses until success/failure
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Got success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Got run success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_available_after":0, "fields":[]interface {}{"1"}}}
[BOLT][INFO]2018/10/30 06:54:40 Got success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
[BOLT][INFO]2018/10/30 06:54:40 Got discard all success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
INFO[32400] healthy ping
[BOLT][INFO]2018/10/30 06:54:40 Got run success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_available_after":0, "fields":[]interface {}{"1"}}}
[BOLT][INFO]2018/10/30 06:54:40 Sending RUN message: query return 1 (args: map[string]interface {}(nil))
[BOLT][INFO]2018/10/30 06:54:40 Got success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
[BOLT][INFO]2018/10/30 06:54:40 Got run success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_available_after":0, "fields":[]interface {}{"1"}}}
[BOLT][INFO]2018/10/30 06:54:40 Got discard all success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
[BOLT][INFO]2018/10/30 06:54:40 Got discard all success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":0, "type":"r"}}
INFO[32400] healthy ping
[BOLT][INFO]2018/10/30 06:54:40 Sending PULL_ALL message
INFO[32400] healthy ping
[BOLT][INFO]2018/10/30 06:54:40 Sending PULL_ALL message
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming all responses until success/failure
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming all responses until success/failure
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Consuming response from bolt stream
[BOLT][INFO]2018/10/30 06:54:40 Got success message: messages.SuccessMessage{Metadata:map[string]interface {}{"type":"r", "result_consumed_after":0}}
[BOLT][INFO]2018/10/30 06:54:40 Got run success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_available_after":1, "fields":[]interface {}{"1"}}}
[BOLT][INFO]2018/10/30 06:54:40 Got success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":1, "type":"r"}}
[BOLT][INFO]2018/10/30 06:54:40 Got discard all success message: messages.SuccessMessage{Metadata:map[string]interface {}{"type":"r", "result_consumed_after":0}}
[BOLT][INFO]2018/10/30 06:54:40 Got run success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_available_after":0, "fields":[]interface {}{"1"}}}
[BOLT][INFO]2018/10/30 06:54:40 Got discard all success message: messages.SuccessMessage{Metadata:map[string]interface {}{"result_consumed_after":1, "type":"r"}}
INFO[32400] healthy ping
INFO[32400] healthy ping
ERRO[32421] error with health ping
error="Couldn't read expected bytes for message length. Read: 0 Expected: 2.\nInternal Error(*errors.errorString):driver: bad connection\n\n Stack Trace:\n\ngoroutine 26 [running]:\nruntime/debug.Stack(0x12fc078, 0x46, 0xc4201e3c20)\n\t/usr/local/opt/go/libexec/src/runtime/debug/stack.go:24 +0xa7\ngithub.com/johnnadratowski/golang-neo4j-bolt-driver/errors.Wrap(0x1322b80, 0xc420096b00, 0x12fc078, 0x46, 0xc4201e3c20, 0x1, 0x1, 0xc420096b00)\n\t/Users/z0031yj/go/src/github.com/johnnadratowski/golang-neo4j-bolt-driver/errors/errors.go:38 +0x188\ngithub.com/johnnadratowski/golang-neo4j-bolt-driver/encoding.Decoder.read(0x1322ba0, 0xc4202a6b60, 0xc420378690, 0x1012008, 0x70, 0x12bca60)\n\t/Users/z0031yj/go/src/github.com/johnnadratowski/golang-neo4j-bolt-driver/encoding/decoder.go:45 +0x422\ngithub.com/johnnadratowski/golang-neo4j-bolt-driver/encoding.Decoder.Decode(0x1322ba0, 0xc4202a6b60, 0xc420378690, 0xc4202a6b60, 0xc420378620, 0xffff, 0x12a09e0)\n\t/Users/z0031yj/go/src/github.com/johnnadratowski/golang-neo4j-bolt-driver/encoding/decoder.go:93 +0x43\ngithub.com/johnnadratowski/golang-neo4j-bolt-driver.(*boltConn).consume(0xc4202a6b60, 0x12ebe79, 0x8, 0x0, 0x0)\n\t/Users/z0031yj/go/src/github.com/johnnadratowski/golang-neo4j-bolt-driver/conn.go:552 +0xb2\ngithub.com/johnnadratowski/golang-neo4j-bolt-driver.(*boltConn).sendRunPullAllConsumeAll(0xc4202a6b60, 0x12ebe79, 0x8, 0x0, 0xc42019a028, 0xc4201e3e30, 0x1029dc4, 0x13015d0, 0xc4201e3e40, 0xc420077080, ...)\n\t/Users/z0031yj/go/src/github.com/johnnadratowski/golang-neo4j-bolt-driver/conn.go:699 +0x6e\ngithub.com/johnnadratowski/golang-neo4j-bolt-driver.(*boltStmt).ExecNeo(0xc4201e3eb8, 0x0, 0xc4201e3eb8, 0xc4201e3ef8, 0x124c235, 0xc4200b0320)\n\t/Users/z0031yj/go/src/github.com/johnnadratowski/golang-neo4j-bolt-driver/stmt.go:99 +0x7c\ngithub.com/johnnadratowski/golang-neo4j-bolt-driver.(*boltConn).ExecNeo(0xc4202a6b60, 0x12ebe79, 0x8, 0x0, 0x0, 0x0, 0x0, 0x0)\n\t/Users/z0031yj/go/src/github.com/johnnadratowski/golang-neo4j-bolt-driver/conn.go:842 +0xcc\nmain.ping(0x13243c0, 0xc4200b0320)\n\t<scrubbed>/neo4j-test/main.go:57 +0x7b\nmain.aliveTask(0x13243c0, 0xc4200b0320)\n\t/<scrubbed>/neo4j-test/main.go:50 +0x7a\ncreated by main.main\n\t<scrubbed>/neo4j-test/main.go:41 +0xde\n"
You'll also see successful interleaved queries at times too, I'm not sure what underlying conditions are causing the issue.
Reading over the documentation it looks like you are meant to close the drivers from the pool. I saw there is another CloseableDriverPool which is not thread safe. Once I stopped closing the pool connections it stopped giving me issues but I am also using a different server.
I've been using the driver in an application for about 5 months and have had very few issues. However, recently I've noticed I've been getting
bad connection
errors. It always seems to be pop up from the decoderread()
call. The error and stack trace look like this: Error MessageCouldn't read expected bytes for message length. Read: 0 Expected: 2. Internal Error(*errors.errorString):driver: bad connection
Stack TraceI am running in the pooled connection mode with 5 connections currently. The code below is how I am building the pool and retrieving a connection object from the pool.
The errors seem to happen in very short bursts and are consistently come in groups of 5 and always very close together(usually same millisecond), guessing this is all 5 of my connections in the pool being closed/resetting? I don't see any errors on the Neo4j server logs side about closing connection, connection reset by peer etc.
Any ideas as to what's going on here? Are there any configuration changes I could make that might help with this?