Clarification on CrateDB resiliency and wait_for_active_shards = 1

seut commented 3 years ago

@chicco785 As the related PR #384 is already closed and locked (sorry for the delay), I'd like answer your (still open?) questions about CrateDB's resiliency, in detail about wait_for_active_shards = 1, here.

one thing we'd like to understand though is if wait_for_active_shards = 1 could result in a situation where Crate won't accept any further writes to a table and the only way out of that is to manually alter the table settings.

The wait_for_active_shards value only affects table (or partition) creation and is not (directly) affecting writes. In case of a partitioned table, new partitions are created on-fly taking this setting into account and such this setting will partly affect writes here. If not all defined N shards for this new partition are active, the write will stall until the replica becomes active or an internal timeout of 30s is reached. Thus writes get slow if e.g. the setting is set to > 1 and replica shards are unassigned/unreachable due to a missing node.

As @amotl already pointed out: To avoid such slow writes (and possible data loss due to missing replicas ) when write_for_active_shards is set to >1 while doing a rolling upgrade, cluster.graceful_stop.min_availability should be set to full and nodes must be shutdown gracefully . By doing so, it is ensured that primary and replica shards are moved away from the to-shutdown node before the node will stop.

Here's the scenario:

A node N1 holds a primary shard S with records r[1] to r[m + n].

Another node N2 holds S's replica shard, R, with records r[1] to r[m], i.e. n records haven't been replicated yet.

N1 goes down.

Crate won't promote N2 as primary since it knows R is stale w/r/t S.

The only way out of the impasse would be to manually force replica promotion.

As described above, this scenario is not related to the write_for_active_shards but to how resiliency of writes is achieved in general. On a write operation, the response to the client will only be sent (acknowledged) once all active replica shards responded (or timed out), so writes are either successful (all shards succeeded) or failed (one or multiple shards failed). Despite of that, your assumption is correct.

Updated In case N1 goes down before the operation request was sent on the replica shard at N2, the cluster will not promote the (stale) replica as a new primary and thus won't process any new writes, resulting in a red table health. After the primary shard comes back (yellow/green health, writes possible again), the missing operations are synced to the replica. If the primary cannot be started (e.g. due to disk corruption) the replica can be forced to be promoted as the new primary. Of course the missing operations are then lost.

If N1 goes down after the replication request was sent, the replica may process the operation and afterwards can be promoted as a new primary.

c0c0n3 commented 3 years ago

@seut wow, thank you so much for the detailed explanation, you're a superstar :-)

just a couple of questions if you still have time to spend on this. I haven't had the time to eyeball the reference docs you mentioned, if the answer to my questions is there, don't bother replying w/ a detailed answer, I'll look it up myself :-)

If N1 comes back online, it will recognize that N2 has a valid (newer, more recent promoted) primary shard of r[1] to r[m]

What's the fate of records r[m+1], ..., r[n] in this case? Will they be lost or still recoverable?

The other question I have is: what kind of sequence of events could result in Crate freezing a table so that you would have to manually force replica promotion? Is this how I could recover records r[m+1], ..., r[n]?

seut commented 3 years ago

@c0c0n3 I was wrong in my description 🤦 , apologize. I've updated my original comment to avoid confusions instead of adding a correction. This updated description should clarify your open questions and also validates that your initial assumption on when a forced promotion is needed, was right indeed.

c0c0n3 commented 3 years ago

@seut no need to apologise! and thank you a stack for the update, that answers all my questions!

chicco785 commented 3 years ago

@seut thanks, we will add this information in the faqs

orchestracities / ngsi-timeseries-api

Clarification on CrateDB resiliency and wait_for_active_shards = 1 #409