nats-io / nats-streaming-server

NATS Streaming System Server
https://nats.io
Apache License 2.0
2.51k stars 284 forks source link

nats streaming follower is not restoring channels from leader #1283

Open fowlerp-qlik opened 1 year ago

fowlerp-qlik commented 1 year ago

Hi. On January 16th we performed a Kubernetes upgrade. This involved rolling the kubernetes nodes in a given order such that the nats nodes/pods would roll first then the nats streaming pods would roll with the two followers first and the leader last.

I see the following sequence:

1) Nats-streaming-0 becomes the new leader at 17:41:12 2) Nats-streaming-2 restores two channels at 17:43:26 3) Immediately after the restoration nats-streaming-2 gets runtime error: panic: runtime error: invalid memory address or nil pointer dereference 4) nats-streaming-2 restarts but very quickly shuts down due to "STREAM: Failed to start: log not found" 5) nats-streaming-2 restarts again but again shuts down for the same reason 6) and one more time 7) nats-streaming starts up again 1.75 hours later (no idea why such a delay). NO restoring is performed even though some channels restored in steps 4, 5 and 6. 8) Yesterday we rolled the nats-streaming-2 pod but again no channels were restored.

Is there a way to mitigate/correct such that nats-streaming-2 restores the channels. I worry that if there is a subsequent leader election, nats-streaming-2 may become the leader with potentially bad results.

Log file from nats-streaming-2:

nats-streaming-not-attempting-restore.csv

kozlovic commented 1 year ago

@fowlerp-qlik I think that nats-streaming-2 state (both streaming and raft) should be removed and this server restarted.

The original panic (the last in the file since it looks like the lines timestamp is from latest to oldest), seem to indicate that this server (a follower) received a snapshot from the leader with a channel that has an ID that is lower than its own, which should NEVER happen.

Channels have an ID to be able to handle the situations where a channel "foo" is created, then is deleted due to max inactivity, but then recreated later. Without a unique ID when the channel is created, when a server restarts, it may recover its own snapshot file that could have information about a channel "foo" that at the time had messages 1 to 10 for instance: this was the state of this channel at the time of the snapshot. But later this channel was expired and then new messages on the same channel name were added, say "foo" from 1 to 3. If the server was restarted and recovers the snapshot, it would think that channel "foo" should have messages 1 to 10 (from the snapshot information), while its own streaming state has only from 1 to 3. Reverse could be true, that is, suppose that the local storage has messages 1 to 20, it would think it has 10 more messages that it should and would try to delete them. This is why channels are assigned an ID when created and that is part of the snapshot, etc..

With that being said, it looks like when "nats-streaming-2" received a snapshot from the leader about a channel say "foo", it had an ID, say 10, while its own version of the channel had for that same channel name ID say 20, which would indicate that it has a newer version of this channel. A function that is supposed to lookup the channel with a given ID notices that and returns nil for the channel, which in the case of processing of a snapshot from the "init" phase, meaning processing its own snapshot, we simply ignore this channel and move to the other(s), but at "runtime", we don't and this causes the panic. It would be easy for the code to skip if channel is nil in both conditions, but I believe this is wrong because again this situation should never happen, and so this would likely cause even more mess.

Could it be that this server was stopped but its state not cleared while the two other servers had at one point their state cleared and restarted and ran for a while before "nats-streaming-2" was restarted but with its previous state?

fowlerp-qlik commented 1 year ago

The nats cluster is part of a large and live production environment so a server restart would be disruptive. The Raft log if file based but our Kubernetes chart specifies a ram disk (for speed) so if we shut all three nats streaming pods down and restart them there will be message loss.

Currently the follower seemingly doesn't even try to restore its channels from the leader on a pod restart. Is there any way to prod it to do so?

kozlovic commented 1 year ago

@fowlerp-qlik I have not suggested to restart the whole cluster, just the one server that had the panic and does not seem to recover channels. Again, I was saying to clear its state (both RAFT and datastore directories) and it should recover the whole state from the leader.

I have tried to figure out what conditions could lead to the panic and could not find it, unless (and now that I know that the state is ephemeral due to RAMDISK) that server (nats-streaming-2) was separated from the network, but not stopped, so it maintained its state, while the 2 others had been restarted at one point with a clean state. But you would likely have noticed that.