palantir / atlasdb

Transactional Distributed Database Layer
https://palantir.github.io/atlasdb/
Apache License 2.0
54 stars 10 forks source link

Paxos Timelock Server can reset to 0; seems to be on leader failure after all followers fail, lose logs and restart #1504

Closed jeremyk-91 closed 7 years ago

jeremyk-91 commented 7 years ago

Jepsen tests for Paxos timelock server with a nemesis that stops the server, wipes the logs, and restarts the server failed 5 of 20 runs yesterday night.

Copypaste from Slack: Logs are consistent with hypothesis that the leader does not reissue lost logs until a new bound is agreed upon (which from my knowledge of the code is the case).

All failures are consistent with this pattern: (1) a leader L agreed on bound B, (2) all nodes apart from L were killed and restarted in some order without a new bound being agreed on, (3) L was killed.

I’ll rerun the jepsen tests flushing the logs every time (i.e. request 1M timestamps instead of 1). This is not the fix, but intended to flush out the cause (if this hypothesis is correct, then that case should never fail).

Assuming this hypothesis is correct, a way to deal with this is to, on startup, poll learners for the latest knowledge they have, and learn the latest state.

tpetracca commented 7 years ago

@jeremyk-91 I think this may be related to what I was saying the other day about us needing to start thinking about recovery from loss of paxos logs. The way I understand it there are two scenarios here:

  1. We have a single node log loss (or less than quorum nodes log loss) - In this case there is a chance that the person who lost their logs had agreed to something they no longer remember. I believe the best way to handle this situation is something that resembles https://github.com/palantir/atlasdb/blob/2a7f4be7741bc252ae94c52a613baa18ec39bb0e/atlasdb-cli/src/main/java/com/palantir/atlasdb/cli/command/TruncatePaxosLog.java
  2. We have a quorum or greater amount of log loss - In this case I assume we've lost the maximum timestamp value and would have to do a fast-forward, but I'm not positive given my limited knowledge of how this all works.

CC @carrino @rjullman

carrino commented 7 years ago

How are we losing so many logs? Our Paxos is designed to handle fail stop nodes but not byzantine failures.

Having a node not respect it's promises or accepts is considered a byzantine failure. This is the node restore case. We rely on the fact that n is agreed upon before we talk about n+1 in our truncate new node code. This makes it so that new node can take the place of the old one by respecting it's promises (by rejecting all requests to these sequences)

If we want to do fancier membership we have to look at overlapping quorums. If you want to be able to just wipe a node and bring it back up (with no truncate) you actually need 4 nodes with 3 agreeing.

In the three case you have A B C and C dies and becomes D. You can see that we don't have overlapping quorums if AC is one and BD is another.

In the 4 case ABCD if D->E then any quorum with D will still need one overlapping node with E. ABD and CEX X must be A or B

On Jan 25, 2017 9:17 AM, "Tom Petracca" notifications@github.com wrote:

@jeremyk-91 https://github.com/jeremyk-91 I think this may be related to what I was saying the other day about us needing to start thinking about recovery from loss of paxos logs. The way I understand it there are two scenarios here:

  1. We have a single node log loss (or less than quorum nodes log loss)
  2. We have a quorum or greater amount of log loss - In this case I assume we've lost the maximum timestamp value and would have to do a fast-forward, but I'm not positive given my limited knowledge of how this all works.

CC @carrino https://github.com/carrino @rjullman https://github.com/rjullman

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/palantir/atlasdb/issues/1504#issuecomment-275170860, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7s2yNlj5gv-aI--w0KDu8grsmdRSkbks5rV4N8gaJpZM4LtpSw .

jeremyk-91 commented 7 years ago

@carrino How timestamps were lost: this is a part of our Jepsen (HA) testing, which is intentionally super-harsh. Thanks for the info that we don't currently handle this - we should eventually do so (though not sure this is an immediate priority).

I think the problem we found during the Jepsen testing was that because we only agree on anything when we reach one million timestamps, as far as the Paxos logs are concerned the tests result in complete cluster wipes.

@tpetracca Long-run, agree with the truncation process for node recovery. I'm not too sure you can completely safely handle quorum+ fail at all, since your surviving nodes could be a minority partition that never got any news about subsequent agreements; an empirical engineering approach could involve fast-forwarding to the highest known bound + a large constant.

As discussed above, this specific issue with Jepsen deals with Paxos seeing a total cluster failure, even though this happened across multiple distinct partial cluster failures that should in theory be recoverable. (But the overall issue is something we should consider.)

I agree that this is something that should be done, but I'm not sure we will have the bandwidth to address it right now. While possible it seems like an edge case, as we'd need a chain of events like this to happen, in say a 3 node cluster A, B, C where we suppose A > B > C for breaking ties in proposals.

In the short run, repair the Jepsen tests for the most part (these still leave a very small probability of strobing):

We can remove all probability of strobing from this specific issue, by combining one of the above solutions with demanding a quorum of 3/4 (or 4/5) as @carrino explained. [Edit: Probably will still strobe since you need another sequence number to be agreed upon before the next wipe.]

Longer-term:

carrino commented 7 years ago

If a node loses all it's logs, it isn't allowed to come back up again. If you want that node to come back up you need to run truncate which will force it to ignore messages for sequences that it may have participated in before.

3/4 will work for the first log delete failure, but it is important that at least one seq must be agreed on after the wipe before another node may be wiped.

On Wed, Jan 25, 2017 at 11:56 AM, Jeremy Kong notifications@github.com wrote:

@carrino https://github.com/carrino How timestamps were lost: this is a part of our Jepsen (HA) testing, which is intentionally super-harsh. Thanks for the info that we don't currently handle this - we should eventually do so (though not sure this is an immediate priority).

I think the problem we found during the Jepsen testing was that because we only agree on anything when we reach one million timestamps, as far as the Paxos logs are concerned the tests result in complete cluster wipes.

@tpetracca https://github.com/tpetracca Long-run, agree with the truncation process for node recovery. I'm not too sure you can completely safely handle quorum+ fail at all, since your surviving nodes could be a minority partition that never got any news about subsequent agreements; an empirical engineering approach could involve fast-forwarding to the highest known bound + a large constant.

I agree that this is something that should be done, but I'm not sure we will have the bandwidth to address it right now. While possible it seems like an edge case, as we'd need a chain of events like this to happen, in say a 3 node cluster A, B, C where we suppose A > B > C for breaking ties in proposals.

  • A is the leader, sends prepare(1A) to C
  • C sends promise(1A) to A
  • A sends accept!(1A) to C
  • C crashes
  • C restarts, losing all logs
  • A loses leadership to B
  • B sends prepare(1B) to C
  • C sends promise(1B) to B // bad!
  • B and C finish the protocol, agree on 1B

In the short run, repair the Jepsen tests for the most part (these still leave a very small probability of strobing):

  • Have Timelock nodes' learners learn the latest learned value on startup, or
  • Make Jepsen less aggressive, so that the situation I described in the issue is impossible (not sure if this is easy to do)

We can remove all probability of strobing from this specific issue, by combining one of the above solutions with demanding a quorum of 3/4 (or 4/5) as @carrino https://github.com/carrino explained.

Longer-term:

  • Implement truncate, and
  • Add a truncate command to Jepsen's node start-stopper, and
  • Learners learning the latest value on startup will still be required for liveness in Jepsen, if we go with the first solution above.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/palantir/atlasdb/issues/1504#issuecomment-275215476, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7s23Up1aDa4s7e80OpqSp2gjnTPjVYks5rV6kAgaJpZM4LtpSw .

jeremyk-91 commented 7 years ago

Yes, good point on it requiring another number to be agreed upon before wiping another node. Will discuss this when I can sync up with the rest of the team.

jeremyk-91 commented 7 years ago

Edit: Deprioritised for the moment, since this (1) is an edge case, (2) is not a regression and (3) we have other higher priorities to deal with. Nonetheless this is a correctness bug and should be worked on at some point in the future.

harrybiddle commented 7 years ago

We merged #1506 today. In that PR, when a node starts up it makes a best-effort attempt to learn the highest value from the other nodes. This covers some more probable failure cases, but not the rarer edge case in #1508.

We have no protection against losing the entire cluster. We're discussing some possible things we can do client-side over on #1505 - Best effort client-side validation of timestamp correctness