Flock is a lightweight process registry and forgiving supervisor library for Erlang/Elixir applications.
This project is part of the SpawnFest 2017 contest, a 48hs competition so it could contain unimplemented features, surprises and/or bugs.
If available in Hex, the package can be installed
by adding flock
to your list of dependencies in mix.exs
:
def deps do
[
{:flock, "~> 0.1.0"}
]
end
Documentation can be generated with ExDoc and published on HexDocs. Once published, the docs can be found at https://hexdocs.pm/flock.
A flock is a group of birds and birds live in nests
Flock lets you spawn processes (birds) and decides on which node (nest) to run them based on a consistent hash mechanism (a hash ring), spreading the load over the cluster nodes and respawning those processes on the right node in case of node failures.
When resources are scarce birds tend to migrate
When a node joins or leaves the cluster the hash ring is rebalanced and processes are migrated to the corresponding available nodes. As a consistent hash is used, only some processes will be moved from the node. This is, the existing process are killed and restarted somewhere else in the cluster. No handoff mechanism is implemented.
If you love them set them free...
The spawned processes are NOT linked to their fathers. You can communicate with them by a required name. This name is only valid inside Flock (it is not registered on the BEAM).
Can your son distinguish his pet canary from another?
Flock processes are supervised locally on the node. If a process ends abnormally, the supervisor will restart it. The newly started processed will have the same name but it will not have the same state than it's predecessor. This issue may be solved by recreating the process state from an external source of truth.
Dig a hole in your back yard
When a process ends gracefully (with a :normal
exit status) it is
removed from the set of alive processes.
If you know the name of a bird you can call it
Flock provides call
and cast
much like the one in the
GenServer
. The request is routed to the correct node and then to
your process.
If the canary is gone, get out of the mine
You can tell Flock to deliberately kill a process with stop
.
Two flocks form a bigger flock
A big issue in distributed systems is how to react to network partitions. Flock favors availability over consistency, i.e. it stands on the AP side. If the cluster is under a split-brain situation Flock will run one instance of each process on each side of the partition. Processes created on one side of the partition will not be started on the other side. The same goes for stopping a process.
When the cluster is healed the processes from each side are merged. Afterwards, only one instance of each process is run.
A partition or a re-grouping adds/removes nodes to the cluster, so at this point the processes are rebalanced with the new cluster state.
Yes, it is a very hard problem to solve, but we have relaxed some guarantees:
Processes are not remotely monitored (cross node) but locally.
Starting a process only guarantees that the local node is aware of your request. The process is NOT started after the call. Eventually the new process will be known to all nodes and the node responsible for that process will start it.
Stopping processes has the same behavior than starting them, the process is NOT stopped after the call but it will eventually be stopped.
You never talk directly to your process but thorough Flock. This may add some overhead.
Processes may resurrect or out-live a stop. As changes in the cluster are transmitted with eventual consistency, some information may be lost. For example if you start a process and your node goes down, it may have not communicated those changes to other nodes, so the process won't be started in any node. The same goes for stopping or after a graceful exit.
Processes are assigned to nodes using a consistent hash ring. We are using libring library to this end.
Flock is uses disterl for cross-node communication. Node discovery is provided by the libcluster library also by bitwalker so double thanks to him.
A CRDT is used to keep track of alive processes. The CRDT allows to locally merge the alive processes information of a partition and converge to a consistent result avoiding the need for consensus. Flock uses an Add-Wins Observed/Removed set. The current implementation is a Home-baked less-than-ideal state-based AWORSet.
Testing Flock is easy. You need to have Elixir 1.5.2 installed on your system.
We provide an example Makefile
to test it.
You will have to open as many terminal sessions as nodes you want to test.
In our case we will try it running:
make node1
on one terminalmake node2
on another terminalmake node3
on a third onemake run
on a fourth terminal. This last command will spawn 100 processes
(then number can be changed by running make run num=10
) and balance them
on the cluster made up by those 4 nodes.For calling those processes you can run make call
which will call the local
and remote processes and tell where they are running.
And example output for num=10
is:
19:21:33.165 [debug] bird bird:1 replied :"node1@127.0.0.1"
19:21:33.170 [debug] bird bird:2 replied :"node2@127.0.0.1"
19:21:33.175 [debug] bird bird:3 replied :"test@127.0.0.1"
19:21:33.175 [debug] bird bird:4 replied :"node1@127.0.0.1"
19:21:33.175 [debug] bird bird:5 replied :"node2@127.0.0.1"
19:21:33.179 [debug] bird bird:6 replied :"node3@127.0.0.1"
19:21:33.179 [debug] bird bird:7 replied :"node2@127.0.0.1"
19:21:33.179 [debug] bird bird:8 replied :"test@127.0.0.1"
19:21:33.179 [debug] bird bird:9 replied :"node1@127.0.0.1"
19:21:33.184 [debug] bird bird:10 replied :"call@127.0.0.1"
You can close any of the terminals and you should see the processes that were running there being spawned on some other node.
The node reports basic statistics like node node3@127.0.0.1 has 20.0 % of the load (2 out of 10)
.
The example processes (MyBird
) chirp every 10 seconds showing a message to
keep track of where they are running like:
19:26:01.561 [info] bird bird:6 running on node node3@127.0.0.1
Flock is GenServer
-friendly, you can start/call/cast/stop any GenServer without
any change. Supose you have a GenServer
:
defmodule MyBird do
use GenServer
require Logger
def start_link(args),
do: GenServer.start_link(__MODULE__, args, [])
def init(name) do
Process.send_after(self(), :chirp, 10_000)
{:ok, name}
end
def handle_call(:ping, _from, s) do
{:reply, :pong, s}
end
def handle_cast({:please_reply_me, pid, msg}, s) do
send(pid, msg)
{:noreply, s}
end
def handle_cast(:byebye, s) do
{:stop, :normal, s}
end
def handle_info(:chirp, name) do
Logger.info("bird #{name} running on node #{node()}")
Process.send_after(self(), :chirp, 10_000)
{:noreply, name}
end
end
you can spawn on process by doing:
:ok = Flock.start({MyBird, ["Tweety"], "Tweety"})
This will start that process on some node after some time (eventually consistency rocks).
Then you can call that process by name like:
:pong = Flock.call("Tweety", :ping)
If the process ends abnormally it will be restarted by the local or remote supervisor. If the process ends normally it will be removed from the set of alive processes.
Is Flock the same than using a Supervisor? NO, we do not provide the same guarantees, processes are not linked to their fathers therefore spawned processes can be re-started (due to balancing or errors) without the father even knowing about it. This implies that if the processes are stateful, they must re-create that state from an external source.
Our use case for Flock is the following: We have users connecting and disconnecting to our system. When a user connects, a new process is spawned that is in charge of streaming information to that user (a ticker for example). Those processes are pretty much independet of each other and from the process that spawned them.
First, we want to balance those processes over a (small) number of nodes. This is provided by the hash ring.
Second if a node goes down we want to keep streaming data to connected users from another node. The process state is re-built from an external database.
Third, having Flock lets us dinamically add or remove nodes depending on the load of the system and also to do software upgrade node by node.
Complete (not really) documentation of the code can be viewed on [https://spawnfest.github.io/flock/]
Improve the CRDT [https://github.com/spawnfest/flock/issues/9]
Include multi-node tests using ExUnit
.
Set the timeout for the anti-entropy as a config.
Do some benchmarking comparing remote calls against local calls and against direct erlang messaging.
Brought to you by: