Online partition rebalance mechanism.

GoogleCodeExporter commented 9 years ago

Voldemort need to have a mechanism to add/delete nodes online. 

Task list:

cluster.xml update with new node List. (with minimal partition Delta)
partition transfer in background
swap to use new cluster.xml by all nodes

Original issue reported on code.google.com by bbansal....@gmail.com on 11 Feb 2009 at 7:11

GoogleCodeExporter commented 9 years ago

We should fill this out with more of a design as we have one. I think the 
mechanics
are tricky so writing it down will help us think through all the cases.

Original comment by jay.kr...@gmail.com on 11 Feb 2009 at 11:16

GoogleCodeExporter commented 9 years ago

Ok, for the sake of writing things down and thinking through everything:

With my limited knowledge about Voldemort internals, I think it's going to be 
very
hard to do an absolute live update with no interruption whatsoever. Just 
thinking out
loud, this might be a way to do it:

1. announce the new cluster configuration to all nodes
2. possibly "prepare" for the new cluster by making additional replications to 
the
nodes that will have object copies according to the new configuration and all 
that
comes with it
3. if everything is prepared for the new cluster configuration, announce that 
we're
going to swap
4. at this point, do not accept any new writes
5. if every node agreed that we're going to swap, perform swap
6. accept new writes again
7. nirvana

Obviously 3 - 7 are some form of consensus protocol, and it think Paxos might 
be the
best fit here. Things can get VERY tricky if nodes fail during any of these 
phases,
and I think the best way is to just reject any nodes to "join the pool" while
performing a cluster upgrade (I'm not sure whether this already exists, but 
this will
require an authentication mechanism to be put in place every time a connection 
is
(re-)established).

This means that the cluster configuration should be versioned too in some way: 
when a
node joins a pool (or comes back online after temp failure), it should be able 
to
recognize that its configuration is different from the other nodes, and/or 
whether an
upgrade is in progress. 

This also means that the cluster configuration on disk should rather be 
considered a
bootstrap configuration: a list of known servers a node can ask the cluster
configuration from. Since this can essentially be considered a private key/value
entry in itself, perhaps existing mechanisms can be re-used for this.

Original comment by barthrn7@gmail.com on 12 Feb 2009 at 2:43

GoogleCodeExporter commented 9 years ago

Hey Leon, yes I think you have outlined the essentials of our plan. The hope is 
to
work through the various failure scenarios using the existing consistency 
mechanism
and not have to go to Paxos since that will be hard to implement (and embedding
Zookeeper would greatly increase the operational burden--since then you have to
configure and maintain two storage systems). So for this model, the metadata 
would be
stored in voldemort and versioned using vector clocks. This is half true today, 
in
that clients bootstrap by doing a GET request for cluster.xml or stores.xml 
from a
metadata store; but today we don't allow you to write to the metadata store.

Of course the trouble comes in at point (2), (3) and (4), since presumably as 
you are
transferring data to the new node you are also taking new writes. Where do those
writes go? Presumably they should go to the new node, but the problem is where 
to do
reads from? The guarantee you have is that if the new server takes all the 
writes
then if it has a particular key/value then the value it has is at least as 
current as
what the old server has; but if it doesn't have the value then it could be that 
the
key doesn't exist in the store OR it could be that it just hasn't been 
transferred
yet. So I think the solution is to have the new server act as a sort of proxy, 
taking
all PUTs and answering GETs when it can and proxying the request to the old 
server
when it cannot.

In any case, I think as you can see, we have the basic ingredients but not yet 
the
concrete, tested recipe.

Original comment by jay.kr...@gmail.com on 12 Feb 2009 at 4:47

GoogleCodeExporter commented 9 years ago

Yep, I agree that something integrating ZooKeeper is a bad idea -- don't know 
why I
even mentioned it. About the question where the reads/writes should go to: 
obviously
you have given it more thought than I have. :) 

The proxy solution does cover a lot of scenario's indeed: this also gives 
clients the
ability to immediately use the new configuration without having to worry about 
the
old situation.

I'll make sure to monitor this issue closely. :)

Original comment by barthrn7@gmail.com on 12 Feb 2009 at 5:10

GoogleCodeExporter commented 9 years ago

Step by Step proposal 

1) New node Z reads cluster.xml and finds partitions to steal set its state to 
REBALANCE.
2) Z picks one partition 'P' and sends an update metadata request to all 
servers. 
3) servers recieves new metadata updates it and start serving throwing
InvalidMetadata exceptions to client if queried with old metadata
4) Z request server 'A' holding partition 'P' to stream data for 'P' & start 
writing
key-data tuples.
5) client puts are written on 'Z' overwriting old values if present.
6) client gets are proxied to 'A' if NOT present on 'Z'.
7) Repeat step 2-6 for next partition till steal list not exhausted.
8) Change state to NORMAL.

other nodes remain in NORMAL state all the time. 
NOTE: The stolen partition data is not deleted from original nodes, for BDB 
doesn't
matter as will not be used and will keep living on disk w/o any problem.

so the dependency seems to be on few things 
1) updateMetadata(cluster.xml) on all servers
2) Servers throwing InvalidMetadata to get/put requests
3) Client catching InvalidMetadata request and requesting new metadata from a 
server.

Failure scenarios
1) server 'A' holding partition 'P' is down: Z continues to next partition, 
when 'A'
comes back online it marks keys in 'P' as sloppy. (daily daemon ??)
2) server 'Z' fails while updating : Z can restart from scratch without any 
problem.

Best
Bhupesh

Original comment by bbansal....@gmail.com on 12 Feb 2009 at 10:42

GoogleCodeExporter commented 9 years ago

the above described schemas breaks when there are more than one simultaneous 
node
additions to cluster. you add one node wait for it to stabilize and then add 
the next
one. We can provide one API to tell when the rebalancing is complete. 

does this make sense or is it too tight ??

Original comment by bbansal....@gmail.com on 13 Feb 2009 at 12:49

GoogleCodeExporter commented 9 years ago

What is the scenario where it breaks? Is it the case where two nodes try to 
steal the
same partition?

Original comment by jay.kr...@gmail.com on 13 Feb 2009 at 2:11

GoogleCodeExporter commented 9 years ago

Hi, I knew the code has been checked into another branch. But I have a 
different idea
for the implementation of the feature. Hope we can do some discussion to see 
whether
we can throw some light on it.

1) New node Z reads cluster.xml and finds partitions to steal set its state to 
REBALANCE.
2) Z picks one partition 'P' and sends an stealing partition request to the 
owner of
'P', which is A.
3) A marks 'P' as STEALING and transfers the partition to Z. Every update and 
read on
'P" still goes to A directly. A uses a copy-on-right trick to keep the update 
of 'P'
on a temporary store.
4) After several iterations, the size of the copy-on-right temporary store 
would be
much less than the size of the original partition. A freezes 'P' and transfer 
the
copy-on-right store the Z. A also redirects any update and read of 'P' to Z 
during
this stage and store the update to it's origin db.
5) After 4) is finished, A throws IllegalMetaData to the requests on 'P'. A also
propagates the partition assignment change to other servers. 

The difference with the proposal of  "bbansal.usc" is :
1) A is the arbitrator of the assignment of 'P', so it can refuse the partition
stealing from another server to avoid stealing the partition by two servers 
concurrently.
2) If A or Z is failed before 5), the steal operations is failed and A keeps the
partition without updating the partition assignment.
3) The number of redirected request keeps to a minimum time period. This is 
important
from the performance perspective if A is very busy.

NOTE:
the value could have multiple version during 4) if the old value in the 
copy-on-write
store arrives after the new redirected updated value. But it's not a big problem
because the system embedded the capability of handling version conflict from 
day one.

Original comment by schumi....@gmail.com on 26 Feb 2009 at 1:48

GoogleCodeExporter commented 9 years ago

Hey Schumi,

Interesting idea here again. few things

1) In one cluster configuration a partition can be assigned to Only one node, 
so same
partitions cannot be stolen by two servers at the same time from 'A'. 

2) Redirect GET is not expansive it just add one more network hop for additional
latency. Keeping a separate store on A or redirect is same for performance on 
'A' ,
The proxy mode is better for performance as put load is taken by 'Z' node.

3) In current implementation 'A' do not delete any keys even after successful
rebalancing so we can switch back to original cluster configuration on any 
failure on
Z. (The failure scenarios are not tested fully right now thanks for reminding 
me I
will check these cases :) 

and yes system handle versioning problem transparently so we should not worry 
about that.

Original comment by bbansal....@gmail.com on 26 Feb 2009 at 3:59

GoogleCodeExporter commented 9 years ago

Yes, in some ways this is a better approach. The problem would be that I think 
the
individual stores would have to implement the copy-on-write, so it might be a 
bit
harder to implement.

It reminds me of how virtual server migration works.

We don't really know for sure how cheap redirection is.

Original comment by jay.kr...@gmail.com on 29 Mar 2009 at 10:14

GoogleCodeExporter commented 9 years ago

hmm .. 

copy-on-write can be avoided if we search "stealer Node" (new node) first for 
any get request and only redirect 
the missing entries. All puts are handled by new node nyway so it will have 
latest values. 

agreed redirection might not cheap, specially with the continuos stream load on 
both the servers. nyother idea 
here ??

Original comment by bbansal....@gmail.com on 30 Mar 2009 at 5:59

GoogleCodeExporter commented 9 years ago

Splitting this into multiple pieces for parallel development. 

http://code.google.com/p/project-voldemort/issues/list?can=2&q=Partition+sub+tas
k&colspec=ID+Type+Status+Priority+Milestone+Owner+Summary&cells=tiles

Original comment by bbansal....@gmail.com on 31 Mar 2009 at 1:31

GoogleCodeExporter commented 9 years ago

Closing this as the new updates should go to sub-tasks.

Original comment by bbansal....@gmail.com on 31 Aug 2009 at 11:34

Changed state: Done

nassim-git / project-voldemort

Online partition rebalance mechanism. #30