leo-project / leofs

The LeoFS Storage System
https://leo-project.net/leofs/
Apache License 2.0
1.55k stars 155 forks source link

Happened not relocation of object(s) when removing a node and executing rebalance then add same node #108

Closed yosukehara closed 10 years ago

yosukehara commented 10 years ago

Operation flow:

(Running Cluster)
- detach node_0@127.0.0.1
- rebalance
- attach node_0@127.0.0.1
- rebalance
mocchira commented 10 years ago

I checked if this issue was solved on the latest develop branch, but still remained. but the result has changed. Now rebalance after attach node_0@127.0.0.1 failed with following error. [ERROR] "Fail rebalance"

The output of status on manager console is

status
[System config]
                system version : 0.16.5
                total replicas : 3
           # of successes of R : 1
           # of successes of W : 2
           # of successes of D : 1
 # of DC-awareness replicas    : 0
 # of Rack-awareness replicas  : 0
                     ring size : 2^128
              ring hash (cur)  : bb7a8bfc
              ring hash (prev) : bb7a8bfc

[Node(s) state]
------------------------------------------------------------------------------------------------------
 type node                         state       ring (cur)    ring (prev)   when                        
------------------------------------------------------------------------------------------------------
 S    intel21@192.168.200.21       running     bb7a8bfc      bb7a8bfc      2013-11-29 17:35:30 +0900
 S    intel22@192.168.200.22       running     bb7a8bfc      bb7a8bfc      2013-11-29 17:35:30 +0900
 S    intel23@192.168.200.23       running     bb7a8bfc      bb7a8bfc      2013-11-29 17:35:33 +0900
 S    intel25@192.168.200.25       attached    d74cef7f      df6222db      2013-11-29 17:35:28 +0900
 S    intel26@192.168.200.26       running     bb7a8bfc      bb7a8bfc      2013-11-29 17:35:33 +0900
 G    gateway_0@192.168.200.12     running     bb7a8bfc      bb7a8bfc      2013-11-29 17:07:52 +0900

After a while, output changed a bit.

status
[System config]
                system version : 0.16.5
                total replicas : 3
           # of successes of R : 1
           # of successes of W : 2
           # of successes of D : 1
 # of DC-awareness replicas    : 0
 # of Rack-awareness replicas  : 0
                     ring size : 2^128
              ring hash (cur)  : bb7a8bfc
              ring hash (prev) : bb7a8bfc

[Node(s) state]
------------------------------------------------------------------------------------------------------
 type node                         state       ring (cur)    ring (prev)   when                        
------------------------------------------------------------------------------------------------------
 S    intel21@192.168.200.21       running     bb7a8bfc      bb7a8bfc      2013-11-29 17:35:30 +0900
 S    intel22@192.168.200.22       running     bb7a8bfc      bb7a8bfc      2013-11-29 17:35:30 +0900
 S    intel23@192.168.200.23       running     bb7a8bfc      bb7a8bfc      2013-11-29 17:35:33 +0900
 S    intel25@192.168.200.25       attached    bb7a8bfc      bb7a8bfc      2013-11-29 17:35:40 +0900
 S    intel26@192.168.200.26       running     bb7a8bfc      bb7a8bfc      2013-11-29 17:35:33 +0900
 G    gateway_0@192.168.200.12     running     bb7a8bfc      bb7a8bfc      2013-11-29 17:07:52 +0900

The attached node's ring seems to be synced with others. Below is on remote_console.

(manager_0@192.168.200.11)14> leo_redundant_manager_api:get_members().
{ok,[{member,'intel26@192.168.200.26',"node_758266fb",
             "192.168.200.26",13075,ipv4,1385712456577253,running,168,[],
             []},
     {member,'intel25@192.168.200.25',"node_0ad1c0e3",
             "192.168.200.25",13075,ipv4,[],running,168,[],[]},
     {member,'intel23@192.168.200.23',"node_6dffdb07",
             "192.168.200.23",13075,ipv4,1385712447078001,running,168,[],
             []},
     {member,'intel22@192.168.200.22',"node_59da3d3c",
             "192.168.200.22",13075,ipv4,1385712446638246,running,168,[],
             []},
     {member,'intel21@192.168.200.21',"node_e7ac58fd",
             "192.168.200.21",13075,ipv4,1385712446172865,running,168,[],
             []}]}
yosukehara commented 10 years ago

Fixed codes is not complete when I modified them in this morning. So I'll check and fix that.

yosukehara commented 10 years ago

I've fixed this issue - "leofs/issues/108#issuecomment-29503463".

Then I'll check throughly state of objects after rebalance with "leofs_test"

[State of the system]

status
[System config]
                system version : 0.16.5
                total replicas : 2
           # of successes of R : 1
           # of successes of W : 1
           # of successes of D : 1
 # of DC-awareness replicas    : 0
 # of Rack-awareness replicas  : 0
                     ring size : 2^128
              ring hash (cur)  : 9aeecffa
              ring hash (prev) : 73d9f4b2

[Node(s) state]
-------------------------------------------------------------------------------------------------
 type node                    state       ring (cur)    ring (prev)   when
-------------------------------------------------------------------------------------------------
 S    storage_0@127.0.0.1     running     9aeecffa      73d9f4b2      2013-11-29 09:32:34 +0000
 S    storage_1@127.0.0.1     running     9aeecffa      73d9f4b2      2013-11-29 09:33:04 +0000
 S    storage_2@127.0.0.1     running     9aeecffa      73d9f4b2      2013-11-29 09:33:02 +0000

Thanks for your report.

yosukehara commented 10 years ago

Fixed this issues with leo_redundant_manager v1.2.4