Open axllent opened 7 months ago
What I would guess Is happening is that this client library is not handling a leader election properly. When you remove a node randomly, and that node is the Leader, a new leader will be elected. That takes only a second or so, but doing a write during that time may result in the node responding to the kind of response you see above ("Leader not found").
What do you think the client library should do? Retry the write is probably the right thing to do, since it's a transient issue.
It sounds like you may be on to it - that brief time while nodes are electing a new leader. It seems to be there are two separate (but related) things here, how to deal with queries during the brief time during a leader election, and the fact that a populated return slice (wra[0]
) is seemingly not guaranteed under such (and possibly other?) conditions.
I'm probably not qualified to answer the first - but a retry would be the most elegant (as long as it does not result in an endless loop of retries). As for the panic though, that should never happen as it literally kills the application through no fault of its own.
OK, let me see if I can fix up the library to handle this situation better.
Hi @otoolep! I've been doing some cluster testing in my local network using three computers running rqlited, and a single instance of Mailpit. While ingesting a constant stream of incoming emails via Mailpit (thus writing as fast as rqlite can handle) I have been randomly dropping out one of the three rqlite nodes to test fault tolerance / recovery etc. rqlited seems to handle this very elegantly, but every now and then I get a panic from the gorqlite sql driver
panic: runtime error: index out of range [0] with length 0
:I've tried to find the cause of the panic and I can clearly see where it is happening, just not why it is happening.
WriteOne()
,WriteOneContext()
,WriteOneParameterized()
andWriteOneParameterizedContext()
are all potentially affected by this too as they allreturn wra[0], err
(eg: here). Ifwra[0]
does not exist (ie: empty slice) you will get a panic. I just can't work out what is causingwra
to be empty as your error handling appears to append the error as a result. The only thing I can think of is if the final:... if
results
has no results... Hopefully this means more to you than me?Ensuring the
wra[0]
exists before trying to return it is obviously the safest solution (though maybe not so elegant), and I do not know what the consequences of that are "down the food chain" (you're way more familiar with the code than I am).