loucash / riak_governor

What's an ensemble of leaders? A riak_governor.
BSD 2-Clause "Simplified" License
10 stars 4 forks source link

API is meaningless or even harmful #4

Closed sumerman closed 2 years ago

sumerman commented 9 years ago

I cannot think of a case when the single is_leader(Key) returning boolean may be useful. Its result may easily obsolete even before a process who called the function get a chance to analyze the result. At the very least it should return the amount of leader-lease time remaining and the time when this reminder has been taken.

loucash commented 9 years ago

Hey @sumerman,

well, riak_governor is not meant to give any guarantees that leader won't change while some code is being performed based on is_leader results.

However:

I hope that helps, Lukasz

RomanShestakov commented 9 years ago

I actually find the idea of having API to detect a primary replica for a set of keys to be very useful. The project I am working on now is a distributed task processing scheduler. Jobs which should be executed on remote hosts are distributed across riak_core ring and each physical node in the cluster needs to process the tasks from the local to this host vnodes.

The issue is that I want to have a readability - which means I need to keep multiple replicas of each task in the different vnodes but on the other hand I want to make sure that each task is processed only once.

so, having ability to take a task from the vnode, and check if it is the primary or secondary replica is very useful.

I have to admit I would prefer if the default consensus protocol was Paxos (via riak_ensemble). I see that there are a way to switch between riak_ensemble and rafter but I don't understand why rafter is specified as pre-requisite to start riak_leader even if riak_ensemple is used.

in riak_governor/src/riak_governor.app.src {applications, [stdlib, kernel, riak_core, rafter]},

in any case, on my opition - this is cool and useful project. I am looking forward to full riak_ensemble integration and example how to use it.

thank your guys to building this.

Regards, Roman

darach commented 9 years ago

Hi Roman,

The riak_ensemble provider still needs a few adjustments/changes but one of my use case(s) is not dissimilar to yours so I'm fairly certain this will work for you too. I was going to use a locks/locks_leader based approach but there were issues with healing under certain conditions.

Cheers,

Darach.

sumerman commented 9 years ago

Hi @RomanShestakov

The issue is that I want to have a readability - which means I need to keep multiple replicas of each task in the different vnodes but on the other hand I want to make sure that each task is processed only once. so, having ability to take a task from the vnode, and check if it is the primary or secondary replica is very useful.

It's not possible with a current API because is_leader may return true to several nodes (almost) simultaneously. E.g. Given that [A, B, C] are replicas for the key K1 and a leader process P_Al for this key resides on the node A :

  1. Node A becomes isolated from the others.
  2. During a last couple of milliseconds before the netsplit can be detected by the underlying consensus algorithm a process P1 on the node A calls is_leader(K1).
  3. P1 ends up in a receive statement and goes off the scheduler.
  4. Local leader P_Al processes the request from P1 and sends true back.
  5. On each node a timeout in a consensus peer fires and a new election begins.
  6. P1 returns back to the scheduler and proceeds as if it was on a leader node.
  7. A process P_Bl on the node B becomes a new leader.
  8. Process P2 on the node B calls is_leader(K1).
  9. P2 ends up in a receive statement and goes off the scheduler.
  10. Local leader P_Bl processes the request from P2 and sends true back.
  11. P2 returns back to the scheduler and proceeds as it should on a leader node.
  12. The network heals. Despite a sequence of events like this seems unlikely, it can easily occur in a local network where latencies are relatively low. Voilà a task is being processed by two nodes simultaneously.
RomanShestakov commented 9 years ago

thanks Valery, this is interesting example. sign, nothing is ever easier with distributed systems. thanks for pointing this out.

RomanShestakov commented 9 years ago

Darach - would you be able to share your use case if it is possible?

sumerman commented 9 years ago

Hi @loucash

If, riak_governor is not meant to give such guarantees, this should be stated in bold in the first line of README. Can you give me an example where such API can be useful?

it is based on raft (rafter backend), so if there is no network partitions or node failures, leader won't change; there is no lease time. (@darach, how it behaves with riak_ensemble?)

In Raft (and rafter in particular) a peer will start a new election only after some timeout from the last AppendEntries received. This timeout technically is a leader lease. riak_ensemble uses leader leases for the same purpose. Unfortunately both libraries does not expose these leases.

I see two solutions here:

  • make actions commutative (it is what I'm doing in my project at the moment)
  • we consider another api to subscribe to leader change event
    1. Why do you need a leader election at all if actions in your system are commutative?
    2. This won't solve the problem. A process may have a pile of leader/not_leader in it's message box when it gets scheduled with no way to figure out which statement is true.

I would like to see an API similar to the wooga/locker or some disclaimer on right and wrong use-cases in the README.

Valery.

loucash commented 9 years ago

Hi @sumerman, riak_governor is still under development, you are right, we will update README.

Actions differs from data by having a duration. Nodes can agree on a leader who will perform an action, but it still may fail (network partitioned) and the same action will be performed elsewhere. The one thing I can advise is to make your action fast.

Keeping above in mind leader lease is useless too. You will also end up having two actions performed by two nodes. Consider an example: you have nodes n1, n2, n3. Leader node n1 has a lease for 2000ms, starts an action that takes 3000ms (or whatever, lease ends up in 300ms, your action takes 500ms ...), after 1000ms n1 is separated from n2, n3, node n2 becomes a leader and starts an action. Voilà.

  1. Why do you need a leader election at all if actions in your system are commutative?

Because those actions are (potentially) expensive, if possible I want to perform them once (best effort).

  1. This won't solve the problem. A process may have a pile of leader/not_leader in it's message box when it gets scheduled with no way to figure out which statement is true.

Messages in message box are ordered, right? So it is some solution to notify your action about leader change. Probably the same you would do with the lease: "send me a message after lease time ms to check if I am still a leader". Am I right?

I would like to see an API similar to the wooga/locker or some disclaimer on right and wrong use-cases in the README.

We will work on use-cases. Stay tuned.

Cheers, Lukasz

darach commented 9 years ago

Hi guys,

There isn't a single consensus framework I've looked at that doesn't have issues. Most do not document the conditions under which they occur or indicate appropriate actions w.r.t. recovery when catastrophic. There are nuances, tradeoffs and features that all deserve consideration.

All riak_governor does is make these libraries reasonably pluggable. The caveat emptor for any given scenario is that certain uses MAY not be a good fit for the riak_governor API as it currently stands. Clearly, an is_leader predicate is good enough for a lot of use cases - mine, Lukasz's, Heinz's. Clearly, it isn't for others - Yours or any use case where leasing is nice to have.

Valery, if the steps you outlined are an important use case for you then you can contribute that as a test:

https://github.com/Licenser/governor_test

I have a use case that would benefit from wooga/locker in the future as leasing would be advantageous so I appreciate the concerns you have with an is_leader predicate. I wouldn't use an is_leader predicate there either. But, it is more reasonable to recognise that the shape of problem our use case brings to the table is different from the API offered. That's ok as riak_governor is a work in progress.

Adding wooga/locker as a provider would be a valued contribution as would an API that is more lease oriented. If this is important to you it just means you're likely to get to it first. So, feel free to submit a PR here... We all benefit!

Cheers,

Darach.

darach commented 9 years ago

Oh. #riak_governor on freenode if you want to irc.