How to avoid SPoF? - Githubissues

SWSAmor commented 4 years ago

Is there anyone use queue in production? If yes please help me how to do that! One queue instance is a single point of failure. So we need 3 or more. But with Tarantool master-master replication maybe they won't be in sync - based on what i read about. Inconsistent tubes sounds very bad to me. I read about vshard and Avitos autovshard but even if it's good for failover than I have to fork this repo and replace all of the DMLs to the vshard.router.call. ? I'm very uncertain about that. I appreciate any guidance.

Totktonada commented 3 years ago

Sorry for the long delay.

The way to go is to setup replica that will do just box.cfg(<...>) and fetch data. When the old master is going to be a replica, it should be restarted to stop processing tasks. When the old replica is promoted to master it should wait for all data from the past master (https://github.com/tarantool/tarantool/issues/3808) and call require('queue') to start processing tasks on this instance.

I know, this looks weird. I hope we should implement #60 instead.

LeonidVas commented 3 years ago

IMHO, this only be done with qsync(AFAIU this is Alexander's proposal), but in this case you will degrade the performance of the "queue". It seems like this implementation of "queue" is not designed for this purpose.

Totktonada commented 3 years ago

IMHO, this only be done with qsync(AFAIU this is Alexander's proposal), but in this case you will degrade the performance of the "queue". It seems like this implementation of "queue" is not designed for this purpose.

Nope. And, as you correctly stated, it would lead to significant performance degradation (at least in latency).

A project, where I participate in the past, uses queues for different purposes and each microservice has a replica. In fact, we don't even have automated wait_lsn() and doing it manually. An instance is always started either as replica (with bare minimum of Lua code) and as a master (with full application) and becomes in this state until restart.

It was okay for us, because even with manual switchover we have enough availability to be under required SLA.

The project lands to production with tarantool-1.7.6.47 first time. There was no synchronous replication in tarantool at the moment.

SWSAmor commented 3 years ago

Thank you the answers. It's a financial institution so the manual switchover isn't enough. Async also not. We mustn't lost any transaction. We have service mesh with Consul. Two sites and every service is equal and duplicated per site except queue. Queue is the SPoF. That is what need to be solved. By the way the queue is working flawlessly for months so that is rather theoretic, auditing or SLA problem. But everything can fail. :-)

Mons commented 2 years ago

Exceptionally durable queue may be prepared with synchronous replication and quorum failover (raft). But that's requires at least 2.8.2. @Totktonada, possible an example of such config should be attached

Totktonada commented 2 years ago

@Mons I would start from the manual switchover (without instance restart). I think about this a bit and, it seems, there are quite large amount of tricky details. Even without full automation.

Let's consider scope of this issue as defined below.

Provide a way to opt out of immediate tasks processing after require('queue'), so it may be called on a replica. Or even always require explicit queue.start().
Provide a way to gracefully stop processing of tasks: queue.stop().

It should be enough to switch master-replica manually when there is no parallel processing of tasks on two or more instances ('master-replica' case[^1]) — without restart. Let's consider #60 as a task about 'true master-master'.

[^1]: Technically the replication may be configured as bidirectional, but there is a property (usually box.info.ro), which defines a leader, and only one leader at max may exists in a replicaset.

The queue.stop() should forbid to take tasks. TBD: Think re (un)burrying, delaying and other operations. I guess everything except ack and release should be forbidden, but may be there are valid use cases I don't aware of.

I think that queue.stop() should give a change for taken tasks to be acked (wait till some timeout) and forcibly return the tasks into the 'ready' state if the timeout expired. I think we can lean on task's ttl / ttr and global ttr here. We can also provide a timeout option: queue.stop({timeout = <...>}): it is possible that someone, who didn't bother about timeouts before, will stuck on queue.stop() at the moment when (s)he needs to switch master and replica.

Take care to the following case: someone calls queue.stop(), found that it stucks, run another console and calls queue.stop({timeout = 1}). Both calls should return after one second. More generally: all currently executed queue.stop(<...>) calls should return, when one of them (with minimal timeout) returns.

Or... a client that uses queue's session mechanism may ack a task that was taken on a previous master. Should we expect from a client that it'll reconnect to the new master (possibly after attempt to ack on the past master) and ack the task there? If so, we should allow to distinguish 'I'm not a leader' error from all others and should not autorelease tasks. Don't know, need to think here. Maybe we should support both flows. If we'll support both, whether to choose 'whether to autorelease' automatically based on knowledge about authenicated and non-authenticated task owners or add an option to queue.stop() to release all tasks before return, including ones from authenticated owners? Or both?

The manual switchover scenario will be the following:

Call queue.stop() on master and wait until it'll return.
Wait until all data will land to a replica (all replicas): monitor box.info.vclock and box.info.replication.
Demote the old master to replica: box.cfg{read_only = true}.
Promote the old replica (one of replicas) to master: box.cfg{read_only = false}.
Call queue.start() on the new master.

If we'll change the queue module to don't start tasks processing automatically after require('queue'), it will be incompatible change (all current users should add explicit queue.start() call into its code). I would think, whether we can keep backward compatibility here.

Should queue.start() wait until full box loading if it is called before first box.cfg()? Should it wait until the instance will become writeable if it is the read only state now? I guess that both answers should be 'no', because it looks error-prone.

Questions to think around:

Can we automate the switchover using the RAFT leader election algorithm for asynchronously replicated queues?
How about synchronously replicated queues with RAFT?

I have no clear vision, how it should be implemented to ensure correct automatic switchover.

Totktonada commented 2 years ago

However, maybe, just automatic that leans on ro/rw status would be even simpler and will cover most of cases. I need to think here.

0x501D commented 2 years ago

Proposal:

Create internal queue states: [S]tartup | [R]eady | [E]nding | [W]aiting

S: (state changed to RW) wait for maximum upstream lag * 2 (quiet for clients) start driver (extend drivers api required).
R: run as usual
E: (state changed to RO) stop driver.
W: do nothing (monitor RO/RW status)

other notes: implement per-task ttl.

Totktonada commented 2 years ago

The proposal sounds ok. I would like the following points to be clarified as result of working on the patch (it should be part of the documentation):

How tasks taken in a shared session change its state, when we're going to RW, at startup? Wait until a global TTL and released then? Are there other ways aside take() to associate a task with a shared session? And the opposite: how tasks that are NOT associated with a shared session change its state. Released immediately at going into RW (because there is no way to release them, the old session is gone)?
Which task states are 'immutable' for RW->RO->RW and which are not.

If you can present such information in some friendly form (some diagrams, maybe), it would be nice.

If you'll find more points around user visible behaviour (which should not be vague), let's add more questions here. If you'll define answers for given questions before the patch will be ready, let's share here as well.

LeonidVas commented 2 years ago

Proposal:

Create internal queue states: [S]tartup | [R]eady | [E]nding | [W]aiting

S: (state changed to RW) wait for maximum upstream lag * 2 (quiet for clients) start driver (extend drivers api required).

R: run as usual

E: (state changed to RO) stop driver.

W: do nothing (monitor RO/RW status)

other notes: implement per-task ttl.

Hi! The proposal is ok to me.

General clarification: As far as I understand the author of the task wanted something like synchronous spaces (not to lose a single task in the event of an instance failure), but we implementing the possibility of asynchronous replication (and switching to a replica) with the possible loss of some information when switching.

Several comments:

Maybe [R] - Running
I think status can be saved in the queue.cfg
ttl - ttr https://github.com/tarantool/queue#set-queue-settings
I not see the reasons to do the ttr per tube/task in the scope of this task (I remember that this is my proposal). It looks like we can do it later if necessary. I think this functionality if usually either used or not.
I propose at the first stage to release all tasks before the switching.
Does a driver will have some indicator of "useable when switching"? I mean, some customer "drivers" may not be prepared for such operations.
The "ttl" fibers and "queue_expiration_fiber" must be stopped before the switching.

LeonidVas commented 2 years ago

If you can present such information in some friendly form (some diagrams, maybe), it would be nice.

I think it is better to use plantuml as it is already used in the queue documentation.

0x501D commented 2 years ago

stateDiagram-v2
direction LR
w: Waiting
note right of w
  monitor rw status
end note
s: Startup
note left of s
  wait for maximum
  upstream lag * 2
  release all tasks
  start driver
end note
Init --> s
r: Running
note right of r
  queue is ready
end note
e: Ending
note left of e
  stop driver
end note
w --> s: ro -> rw
s --> r
r --> e: rw -> ro
e --> w

0x501D commented 2 years ago

Limitations on this solution:

queue will not serve tasks actions till Startup state finishes.
task that have been taken and no acked/deleted can be retaken from replica that became master.

tarantool / queue

How to avoid SPoF? #120