Sometimes failed cluster causes sql driver panic

axllent commented 7 months ago

Hi @otoolep! I've been doing some cluster testing in my local network using three computers running rqlited, and a single instance of Mailpit. While ingesting a constant stream of incoming emails via Mailpit (thus writing as fast as rqlite can handle) I have been randomly dropping out one of the three rqlite nodes to test fault tolerance / recovery etc. rqlited seems to handle this very elegantly, but every now and then I get a panic from the gorqlite sql driver panic: runtime error: index out of range [0] with length 0:

ERRO[2024/04/14 16:04:21] [db] error storing message: tried all peers unsuccessfully. here are the results:
   peer #0: http://192.168.7.2:4001/db/execute?timings&level=weak&transaction failed due to Post "http://192.168.7.2:4001/db/execute?timings&level=weak&transaction": dial tcp 192.168.7.2:4001: connect: connection refused
   peer #1: http://192.168.7.50:4001/db/execute?timings&level=weak&transaction failed, got: 503 Service Unavailable, message: leader not found

   peer #2: http://192.168.7.1:4001/db/execute?timings&level=weak&transaction failed, got: 503 Service Unavailable, message: leader not found

panic: runtime error: index out of range [0] with length 0

goroutine 376180 [running]:
github.com/rqlite/gorqlite.(*Connection).WriteOneParameterizedContext(...)
        /home/ralph/.cache/go/pkg/mod/github.com/rqlite/gorqlite@v0.0.0-20240227123050-397b03f02418/write.go:79
github.com/rqlite/gorqlite/stdlib.(*Stmt).ExecContext(0xc008ce1410, {0x108d1b8, 0x1745480}, {0xc000e79860, 0xa, 0x1083460?})
        /home/ralph/.cache/go/pkg/mod/github.com/rqlite/gorqlite@v0.0.0-20240227123050-397b03f02418/stdlib/sql.go:96 +0x3f8
database/sql.ctxDriverStmtExec({0x108d1b8, 0x1745480}, {0x108d730, 0xc008ce1410}, {0xc000e79860, 0xa, 0x108cb80?})
        /usr/lib/go-1.22/src/database/sql/ctxutil.go:65 +0xa3
database/sql.resultFromStatement({0x108d1b8, 0x1745480}, {0x108cb80, 0xc00011c518}, 0xc000cf5438, {0xc000cf58b0, 0xa, 0xa})
        /usr/lib/go-1.22/src/database/sql/sql.go:2670 +0x13a
database/sql.(*DB).execDC(0x0?, {0x108d1b8, 0x1745480}, 0xc0001cc750, 0x10?, {0xc0043c1ae0, 0x99}, {0xc000cf58b0, 0xa, 0xa})
        /usr/lib/go-1.22/src/database/sql/sql.go:1722 +0x42c
database/sql.(*Tx).ExecContext(0xc007d56a80, {0x108d1b8, 0x1745480}, {0xc0043c1ae0, 0x99}, {0xc000cf58b0, 0xa, 0xa})
        /usr/lib/go-1.22/src/database/sql/sql.go:2506 +0xad
database/sql.(*Tx).Exec(...)
        /usr/lib/go-1.22/src/database/sql/sql.go:2515
...

exit status 2

I've tried to find the cause of the panic and I can clearly see where it is happening, just not why it is happening. WriteOne(), WriteOneContext(), WriteOneParameterized() and WriteOneParameterizedContext() are all potentially affected by this too as they all return wra[0], err (eg: here). If wra[0] does not exist (ie: empty slice) you will get a panic. I just can't work out what is causing wra to be empty as your error handling appears to append the error as a result. The only thing I can think of is if the final:

 else {
    return results, nil
}

... if results has no results... Hopefully this means more to you than me?

Ensuring the wra[0] exists before trying to return it is obviously the safest solution (though maybe not so elegant), and I do not know what the consequences of that are "down the food chain" (you're way more familiar with the code than I am).

otoolep commented 7 months ago

What I would guess Is happening is that this client library is not handling a leader election properly. When you remove a node randomly, and that node is the Leader, a new leader will be elected. That takes only a second or so, but doing a write during that time may result in the node responding to the kind of response you see above ("Leader not found").

What do you think the client library should do? Retry the write is probably the right thing to do, since it's a transient issue.

axllent commented 7 months ago

It sounds like you may be on to it - that brief time while nodes are electing a new leader. It seems to be there are two separate (but related) things here, how to deal with queries during the brief time during a leader election, and the fact that a populated return slice (wra[0]) is seemingly not guaranteed under such (and possibly other?) conditions.

I'm probably not qualified to answer the first - but a retry would be the most elegant (as long as it does not result in an endless loop of retries). As for the panic though, that should never happen as it literally kills the application through no fault of its own.

otoolep commented 7 months ago

OK, let me see if I can fix up the library to handle this situation better.

rqlite / gorqlite

Sometimes failed cluster causes sql driver panic #41