prabhatbhattarai / project-voldemort

Automatically exported from code.google.com/p/project-voldemort
Apache License 2.0
0 stars 0 forks source link

Killing Voldermort on stealer node during rebalancing it stops rebalancing and doesn't start again #325

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Polulate data
2. Start rebalancing moving one partition
3. Kill -9 to Voldemort stealer node before it completes rebalance
4. Restart Voldemort

What is the expected output? 
I'm expecting the rebalance tasks are resume in the stealer node.

What do you see instead?
No rebalance task starts.

What version of the product are you using? 
Master

On what operating system?
Linux - CentOS 5.3

Please provide any additional information below.
Metadata was writing correctly in .temp/ directory

[root@bsdhcp15773 config]# more .temp/*
::::::::::::::
.temp/node.id
::::::::::::::
3
::::::::::::::
.temp/rebalancing.steal.info.key
::::::::::::::
[{"stealerId":3, "donorId":1, "partitionList":[1], 
"unbalancedStoreList":["my_store_john10", "my_store_john0"], 
"stealMasterPartitions":[1], "deletePartitionsList":[1], 
"stealerNodeROStoreToDir":{}
, "donorNodeROStoreToDir":{}, "attempt":0}]
::::::::::::::
.temp/server.state
::::::::::::::
REBALANCING_MASTER_SERVER

The problems is that the logic should consider keeping track the "execution 
state" of the rebalance task.  As of today the only thing that is recorded is 
the fact that the server was doing a rebalance and the rebalancePartitionsInfo 
but not the execution status.  So when the Voldemort is brought up again, and 
the Metadata is re-initialized with the information found in ./temp and later 
on the Rebalance.java threads calls attemptRebalance() it finds the there is a 
task already and it thrown an exception.  The task is there because Metadata 
initialization process read the data from local file system BEFORE the 
Rebalance (thread) execute run().

So...Metadata reads the information saved on disk fine (which is what we want) 
but the logic fails to see if this rebalancePartitionsInfo is being executed or 
not... the logic think that because the rebalancePartitionsInfo is alredy there 
"somebody" is executing it...

Original issue reported on code.google.com by john.jav...@gmail.com on 14 Jan 2011 at 11:09

GoogleCodeExporter commented 8 years ago
2 files attached.
the main idea is to has a status of the execution of the 
rebalancePartitionsInfo.
A new enum was created and a class that encapsulate one rebalancePartitionsInfo 
and the execution status of it.

Voldemort server is restarted Metadata is initialized as before.  This time 
will include an extra piece of information in RebalanceState (a class named 
RebalancePartitionsInfoLiveCycle - the one that contains the status with an 
initial value of "NEW"). Now when Rebalance thread executes will check also 
that this rebalancePartitionsInfo exist BUT also it's status.  If the status is 
NEW then will execute (as before - RebalanceAsyncOperation) updating the status 
to "RUNNING".
Now rebalance threads will have a way to see if a task is being executed or not.

The original logic is based on checking if there is Metadata for rebalance 
(which is the case for an incomplete terminated rebalance operation after 
restarting the server), but this is not enough because it'll think that by just 
the fact of finding previous data it will assume that "somebody" is carry on 
....but this is not the case.

Original comment by john.jav...@gmail.com on 14 Jan 2011 at 11:53

Attachments:

GoogleCodeExporter commented 8 years ago
One of the unit test was failing RebalaceStateTest due to the fact that 
RebalancePartitionsInfoLiveCycle.java didn't have "equals()".

Please find attached this class updated 

Original comment by john.jav...@gmail.com on 15 Jan 2011 at 1:04

Attachments: