How does a node know whether itself is primary or "just a replica that not connected to primary"?

ccll commented 1 year ago

According to the doc, only the replica nodes that connected to a primary will have the '.primary' file.

If the .primary file is unavailable then the local LiteFS node is either:

Currently the primary and can accept writes, or

Unable to determine or connect to the primary.

Then my question is how to distinguish a node from the two cases?

The scenario is when doing database schema migrations, I guess it should run on the primary node right? And this is not a network request so I can't simply redirect it to the primary node, so I need a way to detect on every node and run the migrations only on the primary node.

For now I can curl the Consul API to see if current node is primary, but it's not as convenient as a local '.primary' file, how about having this file on every node including the primary?

ccll commented 1 year ago

Sorry my bad, just found out there is a solution to the db migration problem. https://github.com/superfly/litefs/issues/56

Thanks for the great work! I'll dig that solution later.

ccll commented 1 year ago

Also I'd like to contribute some of my naive thoughts on the problem.

If we could force some node to be elected or manually specified as primary, then we could run db migration on the proper node.

One method came to my mind was to use the candidate setting, if we could let LiteFS live reload its config file, through SIGHUP, then we could transfer the primary role during rolling updates.

For e.g we have 5 v1 app nodes running and 1 of them if primary:

start a v2 app node with candidate = true, it should join the cluster as a replica
mark all the 5 v1 nodes candidate = false
kill -SIGHUP to let v1 nodes live reload the config (meanwhile still serving read requests)
the cluster elects the new v2 node as primary as it is the only option
run the migration
rolling updates the 5 v1 nodes gradually (with candidate = true for future failover)

Another method could be let user manually update the Consul KV to force a node as primary, and force all LiteFS nodes reconnect to this new primary. (I'm wondering if this works already as I haven't dig into the code yet :)

These methods would bring some write downtime during the switch of the primary, so just my 2 cents.

benbjohnson commented 1 year ago

@ccll I agree that forcing the promotion of a node would be ideal for migrations. There's one issue related to handing off the primary (#11) that's sorta related but I added another one (#299) around using the new litefs run command to force a promotion on candidate nodes so you can run your migration script.

Ideally, you'd need to deploy to one of your candidate nodes first so they can apply the migration changes immediately. However, it's best if your migrations can work with both the prior version of your application and the new version of your application. Distributed databases are a pain. :)

superfly / litefs

How does a node know whether itself is primary or "just a replica that not connected to primary"? #297