Closed GoogleCodeExporter closed 9 years ago
We should fill this out with more of a design as we have one. I think the
mechanics
are tricky so writing it down will help us think through all the cases.
Original comment by jay.kr...@gmail.com
on 11 Feb 2009 at 11:16
Ok, for the sake of writing things down and thinking through everything:
With my limited knowledge about Voldemort internals, I think it's going to be
very
hard to do an absolute live update with no interruption whatsoever. Just
thinking out
loud, this might be a way to do it:
1. announce the new cluster configuration to all nodes
2. possibly "prepare" for the new cluster by making additional replications to
the
nodes that will have object copies according to the new configuration and all
that
comes with it
3. if everything is prepared for the new cluster configuration, announce that
we're
going to swap
4. at this point, do not accept any new writes
5. if every node agreed that we're going to swap, perform swap
6. accept new writes again
7. nirvana
Obviously 3 - 7 are some form of consensus protocol, and it think Paxos might
be the
best fit here. Things can get VERY tricky if nodes fail during any of these
phases,
and I think the best way is to just reject any nodes to "join the pool" while
performing a cluster upgrade (I'm not sure whether this already exists, but
this will
require an authentication mechanism to be put in place every time a connection
is
(re-)established).
This means that the cluster configuration should be versioned too in some way:
when a
node joins a pool (or comes back online after temp failure), it should be able
to
recognize that its configuration is different from the other nodes, and/or
whether an
upgrade is in progress.
This also means that the cluster configuration on disk should rather be
considered a
bootstrap configuration: a list of known servers a node can ask the cluster
configuration from. Since this can essentially be considered a private key/value
entry in itself, perhaps existing mechanisms can be re-used for this.
Original comment by barthrn7@gmail.com
on 12 Feb 2009 at 2:43
Hey Leon, yes I think you have outlined the essentials of our plan. The hope is
to
work through the various failure scenarios using the existing consistency
mechanism
and not have to go to Paxos since that will be hard to implement (and embedding
Zookeeper would greatly increase the operational burden--since then you have to
configure and maintain two storage systems). So for this model, the metadata
would be
stored in voldemort and versioned using vector clocks. This is half true today,
in
that clients bootstrap by doing a GET request for cluster.xml or stores.xml
from a
metadata store; but today we don't allow you to write to the metadata store.
Of course the trouble comes in at point (2), (3) and (4), since presumably as
you are
transferring data to the new node you are also taking new writes. Where do those
writes go? Presumably they should go to the new node, but the problem is where
to do
reads from? The guarantee you have is that if the new server takes all the
writes
then if it has a particular key/value then the value it has is at least as
current as
what the old server has; but if it doesn't have the value then it could be that
the
key doesn't exist in the store OR it could be that it just hasn't been
transferred
yet. So I think the solution is to have the new server act as a sort of proxy,
taking
all PUTs and answering GETs when it can and proxying the request to the old
server
when it cannot.
In any case, I think as you can see, we have the basic ingredients but not yet
the
concrete, tested recipe.
Original comment by jay.kr...@gmail.com
on 12 Feb 2009 at 4:47
Yep, I agree that something integrating ZooKeeper is a bad idea -- don't know
why I
even mentioned it. About the question where the reads/writes should go to:
obviously
you have given it more thought than I have. :)
The proxy solution does cover a lot of scenario's indeed: this also gives
clients the
ability to immediately use the new configuration without having to worry about
the
old situation.
I'll make sure to monitor this issue closely. :)
Original comment by barthrn7@gmail.com
on 12 Feb 2009 at 5:10
Step by Step proposal
1) New node Z reads cluster.xml and finds partitions to steal set its state to
REBALANCE.
2) Z picks one partition 'P' and sends an update metadata request to all
servers.
3) servers recieves new metadata updates it and start serving throwing
InvalidMetadata exceptions to client if queried with old metadata
4) Z request server 'A' holding partition 'P' to stream data for 'P' & start
writing
key-data tuples.
5) client puts are written on 'Z' overwriting old values if present.
6) client gets are proxied to 'A' if NOT present on 'Z'.
7) Repeat step 2-6 for next partition till steal list not exhausted.
8) Change state to NORMAL.
other nodes remain in NORMAL state all the time.
NOTE: The stolen partition data is not deleted from original nodes, for BDB
doesn't
matter as will not be used and will keep living on disk w/o any problem.
so the dependency seems to be on few things
1) updateMetadata(cluster.xml) on all servers
2) Servers throwing InvalidMetadata to get/put requests
3) Client catching InvalidMetadata request and requesting new metadata from a
server.
Failure scenarios
1) server 'A' holding partition 'P' is down: Z continues to next partition,
when 'A'
comes back online it marks keys in 'P' as sloppy. (daily daemon ??)
2) server 'Z' fails while updating : Z can restart from scratch without any
problem.
Best
Bhupesh
Original comment by bbansal....@gmail.com
on 12 Feb 2009 at 10:42
the above described schemas breaks when there are more than one simultaneous
node
additions to cluster. you add one node wait for it to stabilize and then add
the next
one. We can provide one API to tell when the rebalancing is complete.
does this make sense or is it too tight ??
Original comment by bbansal....@gmail.com
on 13 Feb 2009 at 12:49
What is the scenario where it breaks? Is it the case where two nodes try to
steal the
same partition?
Original comment by jay.kr...@gmail.com
on 13 Feb 2009 at 2:11
Hi, I knew the code has been checked into another branch. But I have a
different idea
for the implementation of the feature. Hope we can do some discussion to see
whether
we can throw some light on it.
1) New node Z reads cluster.xml and finds partitions to steal set its state to
REBALANCE.
2) Z picks one partition 'P' and sends an stealing partition request to the
owner of
'P', which is A.
3) A marks 'P' as STEALING and transfers the partition to Z. Every update and
read on
'P" still goes to A directly. A uses a copy-on-right trick to keep the update
of 'P'
on a temporary store.
4) After several iterations, the size of the copy-on-right temporary store
would be
much less than the size of the original partition. A freezes 'P' and transfer
the
copy-on-right store the Z. A also redirects any update and read of 'P' to Z
during
this stage and store the update to it's origin db.
5) After 4) is finished, A throws IllegalMetaData to the requests on 'P'. A also
propagates the partition assignment change to other servers.
The difference with the proposal of "bbansal.usc" is :
1) A is the arbitrator of the assignment of 'P', so it can refuse the partition
stealing from another server to avoid stealing the partition by two servers
concurrently.
2) If A or Z is failed before 5), the steal operations is failed and A keeps the
partition without updating the partition assignment.
3) The number of redirected request keeps to a minimum time period. This is
important
from the performance perspective if A is very busy.
NOTE:
the value could have multiple version during 4) if the old value in the
copy-on-write
store arrives after the new redirected updated value. But it's not a big problem
because the system embedded the capability of handling version conflict from
day one.
Original comment by schumi....@gmail.com
on 26 Feb 2009 at 1:48
Hey Schumi,
Interesting idea here again. few things
1) In one cluster configuration a partition can be assigned to Only one node,
so same
partitions cannot be stolen by two servers at the same time from 'A'.
2) Redirect GET is not expansive it just add one more network hop for additional
latency. Keeping a separate store on A or redirect is same for performance on
'A' ,
The proxy mode is better for performance as put load is taken by 'Z' node.
3) In current implementation 'A' do not delete any keys even after successful
rebalancing so we can switch back to original cluster configuration on any
failure on
Z. (The failure scenarios are not tested fully right now thanks for reminding
me I
will check these cases :)
and yes system handle versioning problem transparently so we should not worry
about that.
Original comment by bbansal....@gmail.com
on 26 Feb 2009 at 3:59
Yes, in some ways this is a better approach. The problem would be that I think
the
individual stores would have to implement the copy-on-write, so it might be a
bit
harder to implement.
It reminds me of how virtual server migration works.
We don't really know for sure how cheap redirection is.
Original comment by jay.kr...@gmail.com
on 29 Mar 2009 at 10:14
hmm ..
copy-on-write can be avoided if we search "stealer Node" (new node) first for
any get request and only redirect
the missing entries. All puts are handled by new node nyway so it will have
latest values.
agreed redirection might not cheap, specially with the continuos stream load on
both the servers. nyother idea
here ??
Original comment by bbansal....@gmail.com
on 30 Mar 2009 at 5:59
Splitting this into multiple pieces for parallel development.
http://code.google.com/p/project-voldemort/issues/list?can=2&q=Partition+sub+tas
k&colspec=ID+Type+Status+Priority+Milestone+Owner+Summary&cells=tiles
Original comment by bbansal....@gmail.com
on 31 Mar 2009 at 1:31
Closing this as the new updates should go to sub-tasks.
Original comment by bbansal....@gmail.com
on 31 Aug 2009 at 11:34
Original issue reported on code.google.com by
bbansal....@gmail.com
on 11 Feb 2009 at 7:11