Closed vladsud closed 2 years ago
Some more details around the scenario-
All the members present(in Audience) may or may not be equally eligible to become leader. And this eligibility can be revoked/granted from/to any client in between the session. All eligible candidates should try to become leader and one should be able to pick the task. If the leader's eligibility is revoked during session, he must be able to release the task so that some other eligible candidate can pick it. As soon as the leader is disconnected from runtime, others should get to know that leader is disconnected so that rest of the audience can try to become leader.
merging dup #6556
With in the fluid framework there are a number of mechanisms to create consensus for tasks, leader election, and summarization like dds, agent schedule, oldest client, etc.
However, all these methods require clients to be in the quorum, or more generally require them to be on a write WebSocket connection. This creates a problem, as by default the runtime connects via a read connection. Read connected clients are only in the audience, and not the quorum. Currently it is not feasible to create reliable consensus via the audience, as the signals sent to manage the audience are fully decoupled from the op stream so no reliable ordering can be created.
This creates numerous problems
A naive solution could be to order alphabetically by audience client id, but this is not reliable, as new client could join at any moment with a higher id. However, if audience had some connection to the op stream, like a ref seq added at connection time, we could create a reliable ordering based on that and using alphabetical sort as a tie breaker, as new clients will have a higher ref seq.
The default read connection was added to reduce server and storage cost, primarily around no-op, and join/leaves. There may also be other server related optimizations that further reduce the cost of read clients.
Similar to ordering the audience, but have the server pick an eligible client and promote it to write mode, and ensure there is always at least one client in write mode if there is a client eligible for write.
Some additional thoughts on the subject: Given task selection example (which requires selection of qualified client, not just any client), it's better to break problem into two:
This makes system more generic and allows flexibility. This is also similar to how other algorithms approach it, like ZooKeeper algorithm. And it also allows selection of different algorithms for # 1. For example scenarios where only "write" clients are involved, we can chose oldest in quorum solution for # 1, while keeping generic version of # 2.
# 1 (for audience) can be achieved via sorting clientIds in audience and picking one (lowest / highest). It would require some sort of confirmation, as there are no guarantees RE quorum ordering updates (i.e. two clients can see different order of clients appearing or disappearing from audience), so two clients may believe they are leaders at the same time. ZooKeeper relies on quorum and fixed number of nodes, whereas in our case nodes come come and go, so achieving correctness might be challenging
# 2 can be achieved in similar way as ZooKeeper learns about committed changes - by asking all clients if they are already running tasks, and either keeping current assignments or reassigning tasks. This is much easier problem as in the worst case task would migrate to another node in case of conflicts.
Putting it in July to make progress on design.
Poking a bit around guarantees that server provides RE signal ordering (or rather lack of any guarantees), I believe there is no way to arrive at consensus RE leader election. In other words, any solution we will go with will end up with multiple leaders being elected at the same time when races happen. Even adding sequential numbers to clients joining audience (i.e. each join results in atomic increment of a counter and communication of that counter to client) does not help as there will be gaps and we do not know what was the number of last client that left audience. Even adding sequential IDs to all signals does not help, as there is no serialized state to work with. I.e. this sequence is possible:
For the time frame while clients A & B do not know about each other, they will assume that each of them is a leader. Sending signals after join does not help, as there is no guarantee on order of them arriving to other clients or hearing ack back.
So if we do build a solution here, it should be very clear that leader election is probabilistic. We could use delays to decrease races (i.e. potential leader would wait some time before it acts, thus reducing number of cases where there are multiple leaders to a problem - there is no leader), but races are not going away.
This on the surface is not that much different from promises we have today with leader election for "write" clients given that we have lack of transactions, i.e. a client may submit a bunch of ops (changes to a file) and then lean that it was disconnected (and thus it was not a leader), but such ops would make it through on reconnect. This essentially means that we could have multiple leaders at once. The difference, however, is that we could add transactions (in some form) to address that problem, while signals do not have any solution to guarantee single leader election.
Once leader is selected, it can get data on clients currently performing tasks and assign any unassigned tasks based on info it has (and keep doing it as time goes by).
I think the biggest question here - do we want to be in business of delivering and owning such primitive.
I think the only way it works is by adding min seq on the signals, and not picking a leader until it moves forward, either naturally, or by a server generated no-op. That gets us to the same place as a leader based on oldest client in the quorum, as the no-op works as a barrier on signal join. I don't think we need a barrier on leave. It is also batched.
Another related problem we have is that read client don't summarizer, so the above can lead to a large op tail of just no-ops.
i agree it may not be something we want to do.
In general, i think i prefer write by default for clients that support it, but i know that has server implications. I view server enforced write client(s) as a subset of this. They both have the advantage that we don't have to deal with consensus amongst read clients
I think we can relax the original 'read-only' requirement a bit since to do useful/observable work, the leader will generally need to be a writer. If the clients can (eventually) predict the leader using local information, a scheme like the following might work:
Since a sufficiently stale client might observe multiple leadership changes, enqueuing work items needs to be idempotent to handle the following sequence:
At this point it is unclear if the original signal to 'A' was sent before or after leadership moved back to 'A'. The simplest solution is to always retry and make enqueue an idempotent operation, at least within the collab window. This also makes it possible to have a pragmatic backstop of retrying based on a timer.
Some possible optimizations:
Thanks @DLehenbauer & @anthony-murphy !
@prerna21, I think we would want your team to take it from here, for starters - resolve requirements, possibilities, and possibly - build actual solutions. I see couple directions (some are not exclusive, i.e. # 1 & 2 & 4 can be done together):
All but 4 would also require building task assignment logic using signals. I.e. leader is only coordinating task assignment, but task assignment process can be used to select the right client for the job (this assumes that no every client can do jobs). This capability does not exist today. Task assignment can be done predictably (i.e. 100% correct) if leader election process guarantees no more than one leader at a time (this excludes # 3). Otherwise 2+ clients can believe they are assigned given task.
Race condition when multiple leaders are inadvertently selected at once are rare. We can have multiple leaders, if we have tie for sequence number of the agent. We can prevent some clients from participating in the election, if they are in "read" permission only, and have to only include with write permissions or who have write after some time.
On a theoretical level, we can reduce round trip by preparing once and pipeline accepts as seen in this lecture slide Lecture_11_Leader-based_Sequence_Paxos.pdf. I am not familiar with practical distributed systems, just a few theory here and there. @vladsud
Closing old issues. That said, with ODSP recent changes (of ordering join & leave signals), we can make it work now! Just need FRS to follow the suit!
https://dev.azure.com/fluidframework/internal/_queries/edit/750/?triage=true tracks same work (though specific usage).
There is a need to have some kind of structure / mechanism that could be used to do leader election kind of activity (maybe generalized to do more, but there is currently no need for that) without changing a file, i.e. without generating ops.
Scenario - application need to allocate single agent that can do some work on behalf of all users in a document (or run document centric tasks) and potentially commit certain changes to document if task results in changes to document. However if such task does not result in changes to document, no changes are done, thus not causing user who is only viewing document become the one who last edited the document.
Due to these requirements, such mechanism can't rely on quorum, as quorum works only for "write" connections, and thus it requires changes to a file (direct or indirect) do to join op generated on behalf of client. This might be a good point in time to reconsider join/leave/noop being attributed to a user and reasons for "read" connection mode existence (things changed from introduction of "read" connection mode, i.e. frequency of noops is much lower nowadays than it used to be, and all these ops could be made not to be attributed to a user).
But assuming we are not modifying other areas, such mechanism would need to be based solely on signals, and likely - based on Audience.
Related teams thread