Closed seut closed 3 years ago
@seut wow, thank you so much for the detailed explanation, you're a superstar :-)
just a couple of questions if you still have time to spend on this. I haven't had the time to eyeball the reference docs you mentioned, if the answer to my questions is there, don't bother replying w/ a detailed answer, I'll look it up myself :-)
If N1 comes back online, it will recognize that N2 has a valid (newer, more recent promoted) primary shard of r[1] to r[m]
What's the fate of records r[m+1], ..., r[n]
in this case? Will they be lost or still recoverable?
The other question I have is: what kind of sequence of events could result in Crate freezing a table so that you would have to manually force replica promotion? Is this how I could recover records r[m+1], ..., r[n]
?
@c0c0n3 I was wrong in my description 🤦 , apologize. I've updated my original comment to avoid confusions instead of adding a correction. This updated description should clarify your open questions and also validates that your initial assumption on when a forced promotion is needed, was right indeed.
@seut no need to apologise! and thank you a stack for the update, that answers all my questions!
@seut thanks, we will add this information in the faqs
@chicco785 As the related PR #384 is already closed and locked (sorry for the delay), I'd like answer your (still open?) questions about CrateDB's resiliency, in detail about
wait_for_active_shards = 1
, here.The
wait_for_active_shards
value only affects table (or partition) creation and is not (directly) affecting writes. In case of a partitioned table, new partitions are created on-fly taking this setting into account and such this setting will partly affect writes here. If not all defined N shards for this new partition are active, the write will stall until the replica becomes active or an internal timeout of30s
is reached. Thus writes get slow if e.g. the setting is set to >1
and replica shards are unassigned/unreachable due to a missing node.As @amotl already pointed out: To avoid such slow writes (and possible data loss due to missing replicas ) when
write_for_active_shards
is set to>1
while doing a rolling upgrade,cluster.graceful_stop.min_availability
should be set tofull
and nodes must be shutdown gracefully . By doing so, it is ensured that primary and replica shards are moved away from the to-shutdown node before the node will stop.As described above, this scenario is not related to the
write_for_active_shards
but to how resiliency of writes is achieved in general. On a write operation, the response to the client will only be sent (acknowledged) once all active replica shards responded (or timed out), so writes are either successful (all shards succeeded) or failed (one or multiple shards failed). Despite of that, your assumption is correct.Updated In case N1 goes down before the operation request was sent on the replica shard at N2, the cluster will not promote the (stale) replica as a new primary and thus won't process any new writes, resulting in a red table health. After the primary shard comes back (yellow/green health, writes possible again), the missing operations are synced to the replica. If the primary cannot be started (e.g. due to disk corruption) the replica can be forced to be promoted as the new primary. Of course the missing operations are then lost.
If N1 goes down after the replication request was sent, the replica may process the operation and afterwards can be promoted as a new primary.
See also https://crate.io/docs/crate/reference/en/4.3/concepts/storage-consistency.html, https://crate.io/docs/crate/reference/en/4.3/concepts/resiliency.html and https://crate.io/docs/crate/reference/en/4.3/appendices/resiliency.html