tidwall / tile38

Real-time Geospatial and Geofencing
https://tile38.com
MIT License
8.96k stars 561 forks source link

Bug: Tile38 followers can get out of sync with leaders #740

Open danwit-at-lytx opened 1 month ago

danwit-at-lytx commented 1 month ago

As originally reported in the Tile38 Slack Channel

Describe the bug I noticed an issue recently with tile38 leaders and followers. We have a use case with one leader and two followers all running in AWS ECS Fargate instances. They are connected via DNS entries. Here is the issue. AWS periodically will stop and redeploy the running instances for maintenance. Since the leader and followers are separate containers they will not be redeployed at the same time. When the leader is redeployed the existing followers connect to the new leader and start storing data from it. They do not however remove the data from the old leader first and therefore end up with two different copies of the data, one old and one new. This creates a lot of weird behavior and breaks the customer experience.

Note: Our data is ephemeral as its highly time sensitive, therefore when a leader is deployed it starts from scratch, it's not restarting from some existing external copy of the aof file.

To Reproduce I've had this happen multiple times now. Best I can tell when the follower is connected to the leader via a dns host name, if a new instance of the leader is stood up at the same host name (i.e. new db), the follower will start following the new leader but will not clear out its existing data first, resulting in something that is out of sync with the leader. The follower reports that it is caught up to the leader but it is not in sync.

Expected behavior When a follower connects to a leader it should clear out anything already existing in its own DB so that it only contains a copy of the leader's db.

Operating System (please complete the following information):

iwpnd commented 1 month ago

If your data is ephemeral tie the health of your followers to the health of your leader. Therefor if your leader restarts, so do your followers. That’s an option if your architecture supports a downtime of a couple of seconds. As your leader loses it’s aoffile, so should your followers.

that is not to say that I would consider this expected behavior.

We host our tile38 instances on an extra node pool, to avoid unnecessary restarts like that. Also while our leader has its private volume, followers use the node volume, and therefor have ephemeral storage.

edit: I tried to take a look and tracked it down to this:

https://github.com/tidwall/tile38/blob/51e686279795c608f94f8e8f57443713701694db/internal/server/checksum.go#L196

There is no complete checksum between leader and follower, yet I am wondering why in your case the follower aof is not recreated as there is unlikely to even be a partial match between the old leader aof/current follower aof and the new leader aof.

Can you please try to replicate this behaviour with Tile38 1.32.2? @danwit-at-lytx