nats-io / nats-streaming-server

NATS Streaming System Server
https://nats.io
Apache License 2.0
2.51k stars 284 forks source link

Raft log sync between leader and follow fails #1291

Closed fowlerp-qlik closed 1 year ago

fowlerp-qlik commented 1 year ago

Hi. So to bother you one last time. We have integrated the latest nats-streaming version (0.25.5) The previous version we are running is 0.25.2.

In our staging Kubernetes clusters we rolled a non-leader nats-streaming pod to pickup the latest docker image. It tried to get its raft log from the leader. Logs below seem to show that was successful but not sure. Then there was an error involving the unmarshaling of a time value (due to the mismatch in go versions?).

We rolled the second non-leader, same error occurred. Then rolled the leader. A leader was elected and nats streaming clients were happy.

So overall, what would be the impact of the unmarshalling error. Is there a workaround?

[1] 2023/06/28 12:11:01.857543 [ERR] STREAM: raft: failed to appendEntries to: peer="{Voter messaging-nats-streaming-1 messaging-nats-streaming-cluster.messaging-nats-streaming-1.messaging-nats-streaming-cluster}" error=EOF

[1] 2023/06/28 12:11:01.856595 [ERR] STREAM: raft-nats: failed to decode incoming command: error="Time.UnmarshalBinary: unsupported version"

[1] 2023/06/28 12:11:01.856433 [DBG] STREAM: raft-nats: accepted connection: local-address=messaging-nats-streaming-cluster.messaging-nats-streaming-1.messaging-nats-streaming-cluster remote-address=messaging-nats-streaming-cluster.messaging-nats-streaming-0.messaging-nats-streaming-cluster

[1] 2023/06/28 12:11:01.848489 [INF] STREAM: raft: Installed remote snapshot

[1] 2023/06/28 12:11:01.848460 [INF] STREAM: raft: snapshot restore progress: id=300-5857522740-1687954261835 last-index=5857522740 last-term=300 size-in-bytes=55693 read-bytes=55693 percent-complete="100.00%"

[1] 2023/06/28 12:11:01.848437 [INF] STREAM: done restoring from snapshot

kozlovic commented 1 year ago

It looks like the issue is with the fact that the streaming server upgraded the github.com/hashicorp/go-msgpack module to v2.1.0 at one point, while github.com/hashicorp/raft is still using v0.5.5. We would need the RAFT library to update their dependency to the v2.1.0 to handle the time unmarshal'ing.

I followed the steps you described above, but after the 2nd follower was restarted with a newer version (and was getting the time unmarshal error), clients would fail to connect/send/receive. However, after restarting the leader with the new version, then everything is fine. So once the cluster has been fully upgraded, I think that it is fine, but if the upgrade process takes time and after the 2nd follower is upgraded, then I believe that there will be issues with the clients (until the last node is upgraded - or maybe if a leader is re-elected between the 2 new followers).

As I said, there is nothing much we can do at the level of this project, we need github.com/hashicorp/raft to have its github.com/hashicorp/go-msgpack dependency updated.