nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
16k stars 1.41k forks source link

[FIXED] Don't `InstallSnapshot` during shutdown, would race with `monitorStream`/`monitorConsumer` #6153

Closed MauriceVanVeen closed 1 day ago

MauriceVanVeen commented 1 day ago

When stopping a stream or consumer, we would attempt to install a snapshot. However, this would race with what's happening in monitorStream/monitorConsumer at that time.

For example:

  1. In applyStreamEntries we call into mset.processJetStreamMsg to persist one or multiple messages.
  2. We call mset.stop(..) either before or during the above.
  3. In mset.stop(..) we'd wait for mset.processJetStreamMsg to release the lock so we can enter mset.stateSnapshotLocked(). We create a snapshot with new state here!
  4. Now we call into InstallSnapshot to persist above snapshot, but n.applied does not contain the right value, the value will be lower.
  5. Then applyStreamEntries finishes and we end with calling n.Applied(..).

This would be a race condition depending on if 4 happened before or after 5.

It's essential that the snapshot we make is aligned with the n.applied value. If we don't that means we'll replay and need to increase mset.clfs which will snowball into stream desync due to this shift.

The only place where we can guarantee that the snapshot and applied are aligned is in doSnapshot of monitorStream and monitorConsumer (and monitorCluster), so we must not attempt installing snapshots outside of those.

Signed-off-by: Maurice van Veen github@mauricevanveen.com