Mechanism to choose leader that does not involve ops / modifying file

microsoft / FluidFramework

Library for building distributed, real-time collaborative web applications

https://fluidframework.com

MIT License

4.67k stars 523 forks source link

Mechanism to choose leader that does not involve ops / modifying file #6544

Closed vladsud closed 2 years ago

vladsud commented 3 years ago

There is a need to have some kind of structure / mechanism that could be used to do leader election kind of activity (maybe generalized to do more, but there is currently no need for that) without changing a file, i.e. without generating ops.

Scenario - application need to allocate single agent that can do some work on behalf of all users in a document (or run document centric tasks) and potentially commit certain changes to document if task results in changes to document. However if such task does not result in changes to document, no changes are done, thus not causing user who is only viewing document become the one who last edited the document.

Due to these requirements, such mechanism can't rely on quorum, as quorum works only for "write" connections, and thus it requires changes to a file (direct or indirect) do to join op generated on behalf of client. This might be a good point in time to reconsider join/leave/noop being attributed to a user and reasons for "read" connection mode existence (things changed from introduction of "read" connection mode, i.e. frequency of noops is much lower nowadays than it used to be, and all these ops could be made not to be attributed to a user).

But assuming we are not modifying other areas, such mechanism would need to be based solely on signals, and likely - based on Audience.

Related teams thread

prerna21 commented 3 years ago

Some more details around the scenario-

All the members present(in Audience) may or may not be equally eligible to become leader. And this eligibility can be revoked/granted from/to any client in between the session. All eligible candidates should try to become leader and one should be able to pick the task. If the leader's eligibility is revoked during session, he must be able to release the task so that some other eligible candidate can pick it. As soon as the leader is disconnected from runtime, others should get to know that leader is disconnected so that rest of the audience can try to become leader.

anthony-murphy commented 3 years ago

merging dup #6556

With in the fluid framework there are a number of mechanisms to create consensus for tasks, leader election, and summarization like dds, agent schedule, oldest client, etc.

However, all these methods require clients to be in the quorum, or more generally require them to be on a write WebSocket connection. This creates a problem, as by default the runtime connects via a read connection. Read connected clients are only in the audience, and not the quorum. Currently it is not feasible to create reliable consensus via the audience, as the signals sent to manage the audience are fully decoupled from the op stream so no reliable ordering can be created.

This creates numerous problems

Hard to write code to elect leader, or pick tasks without forcing clients into write mode.
Trailing ops do not get reliably summarized
Possible solutions

Order the Audience

A naive solution could be to order alphabetically by audience client id, but this is not reliable, as new client could join at any moment with a higher id. However, if audience had some connection to the op stream, like a ref seq added at connection time, we could create a reliable ordering based on that and using alphabetical sort as a tie breaker, as new clients will have a higher ref seq.

Client connect write by default

The default read connection was added to reduce server and storage cost, primarily around no-op, and join/leaves. There may also be other server related optimizations that further reduce the cost of read clients.

Server enforced "write" client

Similar to ordering the audience, but have the server pick an eligible client and promote it to write mode, and ensure there is always at least one client in write mode if there is a client eligible for write.

vladsud commented 3 years ago

Some additional thoughts on the subject: Given task selection example (which requires selection of qualified client, not just any client), it's better to break problem into two:

selection of the leader, where leader is not the one who does actual work, but the one who is responsible for coordination and assignment of work.
Actual algorithm to distribute work, which will likely be domain specific.
- Such algorithm needs to be stable, i.e. when it comes to tasks distribution, we need to ensure that changes in leadership do not impact (most of the time) node that was chosen to perform a task.

This makes system more generic and allows flexibility. This is also similar to how other algorithms approach it, like ZooKeeper algorithm. And it also allows selection of different algorithms for # 1. For example scenarios where only "write" clients are involved, we can chose oldest in quorum solution for # 1, while keeping generic version of # 2.

# 1 (for audience) can be achieved via sorting clientIds in audience and picking one (lowest / highest). It would require some sort of confirmation, as there are no guarantees RE quorum ordering updates (i.e. two clients can see different order of clients appearing or disappearing from audience), so two clients may believe they are leaders at the same time. ZooKeeper relies on quorum and fixed number of nodes, whereas in our case nodes come come and go, so achieving correctness might be challenging

# 2 can be achieved in similar way as ZooKeeper learns about committed changes - by asking all clients if they are already running tasks, and either keeping current assignments or reassigning tasks. This is much easier problem as in the worst case task would migrate to another node in case of conflicts.

vladsud commented 3 years ago

Putting it in July to make progress on design.

vladsud commented 3 years ago

Poking a bit around guarantees that server provides RE signal ordering (or rather lack of any guarantees), I believe there is no way to arrive at consensus RE leader election. In other words, any solution we will go with will end up with multiple leaders being elected at the same time when races happen. Even adding sequential numbers to clients joining audience (i.e. each join results in atomic increment of a counter and communication of that counter to client) does not help as there will be gaps and we do not know what was the number of last client that left audience. Even adding sequential IDs to all signals does not help, as there is no serialized state to work with. I.e. this sequence is possible:

No clients in audience.
Client A joins - receives empty initial audience.
Client B joins - receives empty initial audience.

For the time frame while clients A & B do not know about each other, they will assume that each of them is a leader. Sending signals after join does not help, as there is no guarantee on order of them arriving to other clients or hearing ack back.

So if we do build a solution here, it should be very clear that leader election is probabilistic. We could use delays to decrease races (i.e. potential leader would wait some time before it acts, thus reducing number of cases where there are multiple leaders to a problem - there is no leader), but races are not going away.

This on the surface is not that much different from promises we have today with leader election for "write" clients given that we have lack of transactions, i.e. a client may submit a bunch of ops (changes to a file) and then lean that it was disconnected (and thus it was not a leader), but such ops would make it through on reconnect. This essentially means that we could have multiple leaders at once. The difference, however, is that we could add transactions (in some form) to address that problem, while signals do not have any solution to guarantee single leader election.

Once leader is selected, it can get data on clients currently performing tasks and assign any unassigned tasks based on info it has (and keep doing it as time goes by).

I think the biggest question here - do we want to be in business of delivering and owning such primitive.

anthony-murphy commented 3 years ago

I think the only way it works is by adding min seq on the signals, and not picking a leader until it moves forward, either naturally, or by a server generated no-op. That gets us to the same place as a leader based on oldest client in the quorum, as the no-op works as a barrier on signal join. I don't think we need a barrier on leave. It is also batched.

Another related problem we have is that read client don't summarizer, so the above can lead to a large op tail of just no-ops.

i agree it may not be something we want to do.

In general, i think i prefer write by default for clients that support it, but i know that has server implications. I view server enforced write client(s) as a subset of this. They both have the advantage that we don't have to deal with consensus amongst read clients

DLehenbauer commented 3 years ago

I think we can relax the original 'read-only' requirement a bit since to do useful/observable work, the leader will generally need to be a writer. If the clients can (eventually) predict the leader using local information, a scheme like the following might work:

The client who believes itself to be leader switches to write mode and uses a consensus-style DDS to "claim" leadership.
- Occasionally a stale client might also attempt to claim leadership, but will discover its mistake as it waits for ACK, making this benign.
Read-only clients signal the client they believe to be leader/writer to queue work items on their behalf.
After signaling the (probable) leader/writer, read-only clients wait for confirmation by:
- Observing their work item was successfully queued, in which case they are done.
- Observing that leadership has changed, in which case the read-only client retries with the new leader.

Since a sufficiently stale client might observe multiple leadership changes, enqueuing work items needs to be idempotent to handle the following sequence:

A read-only client signals 'A' to enqueue a work item.
While awaiting confirmation:
- The read-only client sees 'A' has become ineligible. The client retries with the new leader 'B'.
- The read-only client sees 'B' has become ineligible and that leadership has moved back to 'A'.

At this point it is unclear if the original signal to 'A' was sent before or after leadership moved back to 'A'. The simplest solution is to always retry and make enqueue an idempotent operation, at least within the collab window. This also makes it possible to have a pragmatic backstop of retrying based on a timer.

Some possible optimizations:

Clients who happen to be writers can queue work items directly.
Client who frequently queue a lot of work items (based on a heuristic) might be promoted to writers.
Leaders can defer reconnecting as writers until there is work to process.
A writer who is no longer leader, but mistakenly receives a request to queue a work item, could just go ahead and queue the work item anyway to shorten the loop.
We could (optionally) support multiple leaders if the work queue is parallelizable.

vladsud commented 3 years ago

Thanks @DLehenbauer & @anthony-murphy !

@prerna21, I think we would want your team to take it from here, for starters - resolve requirements, possibilities, and possibly - build actual solutions. I see couple directions (some are not exclusive, i.e. # 1 & 2 & 4 can be done together):

Some users connected to document need to switch to "write" mode, and thus unavoidably changing last edited / who last edited file.
- we can optimize this via various means (including discussion above) on how many users in session need to do it, but given most common case of one user in collab session, we should assume any user might get into such case and be surprised that he / she "modified" a file.
We can make join/leave/noop ops not be attributed to a user, allowing calculated leader (via quorum) to do task assignment (using signals), i.e. not generate extra ops and thus not register this user as the one modifying a file. "Sharepoint app" would be the one changing file, and its modified time would change as well. User might be less surprised by this than in # 1.
A probabilistic tool is built to select a leader using signals. It will not be able to guarantee single leader election at 100% probability, more like 99.99% probability (number can be adjusted by waiting to have any number of 9s). This might be used to drive the rest of the pipeline (and task assignment).
- most likely we would prefer such capability to be built outside of Fluid Framework repo, at least for starters (maybe as experimental package in Fluid repo). This kind of tool is a bit dangerous to have as readily available tool as it's hard to ensure developers would fully understand limitations before jumping and using it.
There are solutions outside of Fluid, like a dialog (with checkbox "remember and do not show it me again" asking user if it's Ok to sync tasks and have byproduct of modifying a file).

All but 4 would also require building task assignment logic using signals. I.e. leader is only coordinating task assignment, but task assignment process can be used to select the right client for the job (this assumes that no every client can do jobs). This capability does not exist today. Task assignment can be done predictably (i.e. 100% correct) if leader election process guarantees no more than one leader at a time (this excludes # 3). Otherwise 2+ clients can believe they are assigned given task.

kenluck2001 commented 3 years ago

Race condition when multiple leaders are inadvertently selected at once are rare. We can have multiple leaders, if we have tie for sequence number of the agent. We can prevent some clients from participating in the election, if they are in "read" permission only, and have to only include with write permissions or who have write after some time.

On a theoretical level, we can reduce round trip by preparing once and pipeline accepts as seen in this lecture slide Lecture_11_Leader-based_Sequence_Paxos.pdf. I am not familiar with practical distributed systems, just a few theory here and there. @vladsud

vladsud commented 2 years ago

Closing old issues. That said, with ODSP recent changes (of ordering join & leave signals), we can make it work now! Just need FRS to follow the suit!

https://dev.azure.com/fluidframework/internal/_queries/edit/750/?triage=true tracks same work (though specific usage).

microsoft / FluidFramework

Mechanism to choose leader that does not involve ops / modifying file #6544

Possible solutions

Order the Audience

Client connect write by default

Server enforced "write" client