unbit / uwsgi

uWSGI application server container
http://projects.unbit.it/uwsgi
Other
3.45k stars 691 forks source link

Improve Legion Lord election algorithm #100

Closed unbit closed 11 years ago

unbit commented 11 years ago

The first purpose of the legion subsystem was ip-takeover. Infact the protocol is very similar to the carp one.

During the "future of the uWSGI clustering" discussions/rfc emerged that a better master ('lord' in legion subsystem gergo) election system is needed.

Inspirations are the paxos algorithm and the redis-sentinel implementation. Objectives are reducing split-brain cases and increase reliability.

unbit commented 11 years ago

Fix situations where nodes/members have the same valor:

Initial legion subsystem specs leave the case of nodes with the same valor undefined.

The new implementation will set a UUID (if uuid support is available) too. When 2 (ore more) nodes with the same valor pop-up the one with the higher UUID (using uuid_compare()) wins.

unbit commented 11 years ago

Quorum:

it is a pretty decent way to reduce split brain problems.

Each legion will expect (by default) a quorum of 1, that means node will not vote for election of a lord. A quorum value higher than 1 will enable the vote mode. Each node will send an additional announce with the choosen lord. When a potential new lord receive at least Q (quorum) announces for itself it will became the new lord.

unbit commented 11 years ago

Multiple lords ?

Are multiple lords useful in some specific context ?

prymitive commented 11 years ago

Quorum You described means that I need to have:

N >= Quorum nodes with --quorum --lord=nodeA

??

There is one great feature in uWSGI that I use - statelessness - only FastRouter nodes are "static", all backends can be moved around, added and removed without touching single line of configuration (and it should be easy to use multicast for subscriptions so I could have those made more dynamic as well). So I am not a fan of putting --lord=nodeA anywhere in my config files, I would prefer just to run some number of nodes and let them pick a lord. I understand that some clustering tools require me to input all nodes, but current legion can talk over multicast just fine and I really like that.

Maybe quorum could be implemented as the minimal number of joined nodes (like with cman and other tools)? So that if I use --quorum 3 than legion would not elect lord until there are at least 3 nodes sending announces.

prymitive commented 11 years ago

BTW - please add legion API so that plugins could interact with it easily. Right now if I don't want to reimplement (which is fancy term for copy&paste) all uwsgi_opt_legion(), uwsgi_opt_legion_hook() and uwsgi_opt_legion() logic I need to format a string with options for legion parameters.

unbit commented 11 years ago

Quorum means that a lord is not elected until Q nodes agree on it. This is required for avoiding (reducing) split brain, and it will be optional (by default no quorum is needed). Regarding stateless the relevant "new part" is the one allowing you to NOT specify the valor of a node. The valor will be somewhat randomic so you do not need to worry about that, and you will be sure only one node per-legion will run "things".

unbit commented 11 years ago

Regarding the api you only need to add the legion name in your options. In my mind a --legion-cron would be something like that:

--legion-cron foobar -1 -1 -1 -1 -1 my_command

that means, run the cron only if the node is the lord of the 'foobar' legion.

In your plugin you will have something like that:

run_cron(mycron) { if (uwsgi_i_am_the_lord(mycron->legion)) { uwsgi_run_command(mycron->command); } }

prymitive commented 11 years ago

Great, everything is clear now.

run_cron(mycron) {
if (uwsgi_i_am_the_lord(mycron->legion)) {
uwsgi_run_command(mycron->command);
}
}

This is pretty much what I was toying with in https://github.com/prymitive/uwsgi/blob/clustered_crons/plugins/clustered_crons/clustered_crons.c

unbit commented 11 years ago

Yes, i think you will be able to extremely simplify it as soon as the api is ready

prymitive commented 11 years ago

Could legion nodes also advertise ip:port of the uWSGI sockets they have to other nodes? This way new node would join legion cluster and once it's joined and the cluster is quorate, it could ask some other node (preferably master) for cache-sync. This would allow for cache syncing without the need to use --cache-sync with fixed node address.

unbit commented 11 years ago

this is part of the "legion scroll" system. You will find reference to it in the current code. Basically you can add a "blob" to each node with custom values.

--legion-scroll = mylegion "foobar=test,youraddr=127.0.0.1:4040"

it is still incomplete but will be ready for sure in time for 1.5 as it is the base for the upcoming clustered-emperor support

prymitive commented 11 years ago

Ok, scroll will be great addition.

Could we also consider allowing user/plugin to provide some code that will dynamically alter nodes valor? Either every N second or/and when legion cluster is altered (node joins or leaves legion). First one would allow master to jump around from time to time (when load on a node is too high we can move master to other, less loaded node) Second one would allow to spread cron tasks around during startup.

Without the ability to alter the valor, we might end up with all tasks on single node. If those tasks are memory or cpu heavy than we might kill that node. Minimum version would be using valor value computed by $(user provided valor) - $(number of legions I am master of).

unbit commented 11 years ago

altering the valor is pretty easy (all of the math is done every time from scratch in the legion subsystem so you can trick it in various way)

something like uwsgi_legion_set/inc/dec_valor(struct uwsgi_legion *legion, uint64_t new_valor);

should be good (and without the need of locking as we are changing a single numeric value)

prymitive commented 11 years ago

Is "legion scroll" going to be in 1.9 or later? Once scroll is ready and there is a way to sync uWSGI cache from legion master node, than we will have basic legion crons plugin. I'm not rushing or anything, I currently use redis script for locking jobs and I won't move from it anytime soon, but others might find that useful and start if so, we will get few extra legion testers ;)

unbit commented 11 years ago

it should be (eventually will be in 1.9.1)

prymitive commented 11 years ago

I don't think there are any issues with current legion implementation, per run uid feature is present and working, scroll is ready and working. Closing as resolved

prymitive commented 11 years ago

There are 2 things that qualify for this issue that I would like to discuss, so I'll reopen it to have discussion in one place.

I'm thinking about replacing keepalived with legion, for virtual ip management on 2 of my fastrouter nodes. I was thinking about what could go wrong and as always split brain can occur (as with all clustering tools and only 2 nodes), so maybe we could add legion node(s) that can't become The Lord and use quorum=2 (or bigger, with multiple arbiters).

What this would give us - if one node can't communicate with the other currently there is no way to check if this is due to issue one local or remote node. But if local node can talk to arbiter nodes and quorum is reached, then we can make a decision who should become a lord.

Second thing that could help - health checks - run something, if it fails lower the valor of local node or prevent it from becoming The Lord, example:



if we can't ping router than network might be down on local node, don't become the lord.

`--legion-valor-check="50 killall 0 redis-server`

if redis-server is not running (and we use it) lower the valor value by 50 "points".
unbit commented 11 years ago

+1 for the arbiter, i will work on this soon after 1.9.7

Regarding the dynamic valor, i am thinking about changing it directly from the apps, so you can have a mule making all the checks you need without spawning a process every time.

unbit commented 11 years ago

The support for arbiters has been added. Just add node with a valor of 0 (they will be reported in the logs as "arbiter" instead of "node")

prymitive commented 11 years ago

Changing valor for mules is ok, one can add checks there with uwsgi timer or cron API, but I think there is use case for more direct manipulation of nodes valor.

Example:

I need to do some maintenance on lord node and I would prefer to move the lord to another node before stopping lords uWSGI instance (in mongodb I would call rs.stepDown() to do so), so it would be nice to have command line option for setting valor of any running legion instance. uwsgi --legion-set-valor mylegion 30 -s localhost:uwsgi_port

I would need to talk to local instance, probably over uWSGI socket (?), it would then set valor to 30, that would trigger new election (or we need to force update in legion cluster).

This would allow me to more gracefully stop lord node.

prymitive commented 11 years ago

We could also add something like stepDown() into uwsgi_reload(), when I need to stop the lord, I would like it to first pass its lord status to other node in as much graceful way as possible, so in case of virtual IP there is as little downtime as it can be.

I'm not sure what happens when uWSGI instance joined to legion dies/reloads, does it send any message to other nodes that it should be marked as dead? Or does it dies and other nodes thinks it's up until legion-tolerance is exceeded?

It looks like the second case, it has pros and cons:

we can quickly reload instance without other nodes in the legion noticing it (in case valor is different and runtime uuids are not used for lord election) - in such case lord title stays on the reloaded node

if node will not reload correctly (upgrade that went wrong) node will not get up and so if it was the lord we need to wait until other nodes will notice that it is down

I think it's just a matter of (optionally) sending announce with valor = 0 when we need to reload, we can make it an option so user will pick desired behavior.

(?)

unbit commented 11 years ago

Ok, i think it is time to deal with it. What to do when an instance is reloaded (or stopped) ? i suppose we need to allow users to choose what to do for each legion. In some case it could be good to leave asap, in others it would be better to wait for the whole cluster to choose the best lord

unbit commented 11 years ago

ok, i think we can close it. Latest commit send a "death announce" whenever the server exit or it is reloaded. There is no need to allow the user to choose it, as the uuid of the node is changed on startup, so effectively the old node is always lost