nats-io / nats-server

High-Performance server for NATS.io, the cloud and edge native messaging system.
https://nats.io
Apache License 2.0
16.02k stars 1.41k forks source link

Support durable messages for disaster recovery #73

Closed thedrow closed 9 years ago

thedrow commented 9 years ago

Most message queues have durable messages for a good reason. If something happens to a machine you can turn one of the replicated nodes into a primary node (the process is known as failover and is implemented in gnatsd if I understand correctly). I suggest in a cluster a node can have to modes: primary and slave. slaves are write only, they dedicate all their resources to writing as many messages to disk as possible and clients are unable to connect to them. This design decision allows the node to dedicate it's CPU and I/O resources completely to writing messages to the disk. The primary nodes are not durable and cannot be turned to be durable. This design decision allows the node to dedicate all of it's resources to receiving and sending messages.

When a primary node fails one of the slave nodes that replicates this primary node is turned to a primary node. This means that messages will not be durable anymore which is desired for reasons stated above. When the primary node comes back to live it becomes the new slave of the old slave node.

The election of the primary node can be done using a consensus algorithm like Raft.

If a cluster does not have any slave nodes a big red warning should appear which can be disabled using a configuration option.

I hope I did not miss anything.

krobertson commented 9 years ago

The design of NATS is specifically the opposite of things like guaranteed delivery and durable messages. With NATS, those kind of needs are placed upon the publisher and consumer rather than inherited by the transport.