Bug: Tile38 followers can get out of sync with leaders

danwit-at-lytx commented 5 months ago

As originally reported in the Tile38 Slack Channel

Describe the bug I noticed an issue recently with tile38 leaders and followers. We have a use case with one leader and two followers all running in AWS ECS Fargate instances. They are connected via DNS entries. Here is the issue. AWS periodically will stop and redeploy the running instances for maintenance. Since the leader and followers are separate containers they will not be redeployed at the same time. When the leader is redeployed the existing followers connect to the new leader and start storing data from it. They do not however remove the data from the old leader first and therefore end up with two different copies of the data, one old and one new. This creates a lot of weird behavior and breaks the customer experience.

Note: Our data is ephemeral as its highly time sensitive, therefore when a leader is deployed it starts from scratch, it's not restarting from some existing external copy of the aof file.

To Reproduce I've had this happen multiple times now. Best I can tell when the follower is connected to the leader via a dns host name, if a new instance of the leader is stood up at the same host name (i.e. new db), the follower will start following the new leader but will not clear out its existing data first, resulting in something that is out of sync with the leader. The follower reports that it is caught up to the leader but it is not in sync.

Expected behavior When a follower connects to a leader it should clear out anything already existing in its own DB so that it only contains a copy of the leader's db.

Operating System (please complete the following information):

AWS ECS Fargate deployed via docker
Tile38 version 1.30.2

iwpnd commented 5 months ago

If your data is ephemeral tie the health of your followers to the health of your leader. Therefor if your leader restarts, so do your followers. That’s an option if your architecture supports a downtime of a couple of seconds. As your leader loses it’s aoffile, so should your followers.

that is not to say that I would consider this expected behavior.

We host our tile38 instances on an extra node pool, to avoid unnecessary restarts like that. Also while our leader has its private volume, followers use the node volume, and therefor have ephemeral storage.

edit: I tried to take a look and tracked it down to this:

https://github.com/tidwall/tile38/blob/51e686279795c608f94f8e8f57443713701694db/internal/server/checksum.go#L196

There is no complete checksum between leader and follower, yet I am wondering why in your case the follower aof is not recreated as there is unlikely to even be a partial match between the old leader aof/current follower aof and the new leader aof.

Can you please try to replicate this behaviour with Tile38 1.32.2? @danwit-at-lytx

Kilowhisky commented 2 months ago

I've been seeing this behavior too. Though for it to happen for me, the leader has to go down and come back up empty. At that point the followers will still have their data and will start following the leader and storing its data as well.

tidwall / tile38

Bug: Tile38 followers can get out of sync with leaders #740