Open nickelization opened 1 year ago
I don’t think Gautam was suggesting we reorder the response packets, but rather to buffer the prepare response until the execute response has been generated, so that if they don’t match up we have a chance to fix things before we start to respond to the client.
That is accurate, but the suggestion does not work for the reasons that Nick and Girffin mention.
Given that, we might be able to get away with just returning an error to the client here.
Agreed. Its not perfect, but it might be okay.
interestingly if when connected directly to postgres you:
SELECT * FROM t;
ALTER TABLE t ADD COLUMN y INT;
you get:
ERROR: cached plan must not change result type
Given that, we might be able to get away with just returning an error to the client here.
{quote}That said, still not sure if that suggestion is feasible, since I would assume the client would not send the execute request to us at all until after we’ve sent the prepare response.{quote}
well, more importantly prepares and executes are not 1:1 - a single prepared statement can be executed any number of times, and the alter table can be replicated at any time
An execute succeeding once gives no guarantee if it will succeed later (specifically, once the ALTER TABLE is in fact replicated, the view will get dropped and the execute will fail)
Can we run our own auto-execute as a test after the prepare?
I don’t think Gautam was suggesting we reorder the response packets, but rather to buffer the prepare response until the execute response has been generated, so that if they don’t match up we have a chance to fix things before we start to respond to the client.
That said, still not sure if that suggestion is feasible, since I would assume the client would not send the execute request to us at all until after we’ve sent the prepare response. But, then again, I didn’t think I noticed any client requests coming in between us sending the RowDescription
and DataRow
packets, so my current mental model feels a little fuzzy to me right now.
We can’t change anything about the ordering of messages, since they’re specified by the client protocol used by the databases.
<\~accountid:609236b4b9ac3a007151a40b> can we treat prepares and the executes as a transaction of sorts? So prepare response can only be sent after the execute is also completed?
prepares can't be “redone” - we've already sent the response to the client
After an ALTER table has been replicated to ReadySet, can we redo any prior prepare statments that we had done, and then of course, do the subsequent execution of the prepare result.
The issue here is not that we’re returning a stale value, but that the actual schema of our results differs from the schema of the prepare response, which breaks all clients. That’d be the case, to varying degrees, in all of my proposed approaches
So, in number 3, is that covered by our notion of eventual consistency? We would return a stale value, but that would be ok?
Ok, here's what's going on here:
SELECT *
from that table
It's not clear what the right thing to do is here - ideally, we'd like to be able to detect that the upstream and readyset prepare responses are different, and make that particular statement upstream only - but since we frequently return different column names and types than upstream (which is a large, but very difficult to solve issue) we can't actually reliably do that with total correctness. So we're left with a few options:
ALTER TABLE
, before the ALTER TABLE
has had a chance to be replicated, will failcc <accountid:62a73f87ddc560006e8a7baf> for product tradeoffs, <accountid:60ae9f5411a545006914db05> for extra eyes on this in case there’s something obvious I’m missing here.
Also note, I maybe should’ve made it extra clear, this just seems to be a temporary race condition – after several seconds have passed, ReadySet no longer returns the bad responses.
I’m not sure if this issue is worth addressing right now, so for the moment I’m just going to file this bug, and also create a CL with a unit test (marked #<ignore>
) that reproduces the failure. Then I will temporarily disable the AddColumn
operation in the DDL vertical tests (or otherwise work around this bug) and continue looking for more issues.
Found a failing test case via the DDL Vertical test suite, which appears to be caused by a bug in ReadySet:
thread 'run_cases' panicked at 'Test failed: index out of bounds: the len is 2 but the index is 2;
We then get a minimal failing case that looks like:
From SyncLinear.com | REA-2216