leo-project / leofs

The LeoFS Storage System
https://leo-project.net/leofs/
Apache License 2.0
1.55k stars 155 forks source link

Upgrade fail from 0.14.6 to latest version #160

Closed Matsue closed 10 years ago

Matsue commented 10 years ago

I tried to upgrade LeoFS from 0.14.6 to latest version. After upgrade, I can not execute suspend command and resume command for storages. It response "[ERROR] Node not exist".

Settings overview

  [System config]
                system version : 0.14.6
                total replicas : 2
           # of successes of R : 1
           # of successes of W : 1
           # of successes of D : 1
 # of DC-awareness replicas    : 0
 # of Rack-awareness replicas  : 0
                     ring size : 2^128
              ring hash (cur)  : 0e6faed4
              ring hash (prev) : 0e6faed4

[Node(s) state]
-------------------------------------------------------------------------------------------------------
 type node                          state       ring (cur)    ring (prev)   when
-------------------------------------------------------------------------------------------------------
 S    storage_0@192.168.101.123     running     0e6faed4      0e6faed4      2014-03-28 19:38:49 +0900
 S    storage_0@192.168.101.124     running     0e6faed4      0e6faed4      2014-03-28 19:38:49 +0900
 S    storage_0@192.168.101.125     running     0e6faed4      0e6faed4      2014-03-28 19:38:48 +0900
 S    storage_0@192.168.101.126     running     0e6faed4      0e6faed4      2014-03-28 19:38:48 +0900
 G    gateway_0@192.168.101.122     running     0e6faed4      0e6faed4      2014-03-28 19:38:49 +0900

Operation logs

I followed this document. http://www.leofs.org/docs/admin_guide.html#upgrade-leofs-v0-14-9-v0-16-0-v0-16-5-to-v0-16-8-or-v1-0-0-pre3

On manager node

# /usr/local/leofs/current/leo_manager_1/bin/leo_manager stop
ok
# /usr/local/leofs/current/leo_manager_0/bin/leo_manager stop
ok
# ln -sTf /usr/local/leofs/1.0.0-pre4 /usr/local/leofs/current
# ls -l /usr/local/leofs/current
lrwxrwxrwx 1 root root 27  3月 28 21:52 2014 /usr/local/leofs/current -> /usr/local/leofs/1.0.0-pre4

# cp -aT /usr/local/leofs/{0.14.6,1.0.0-pre4}/leo_manager_0/work
# find /usr/local/leofs/1.0.0-pre4/leo_manager_0/work/ -maxdepth 2
/usr/local/leofs/1.0.0-pre4/leo_manager_0/work/
/usr/local/leofs/1.0.0-pre4/leo_manager_0/work/mnesia
/usr/local/leofs/1.0.0-pre4/leo_manager_0/work/mnesia/127.0.0.1
/usr/local/leofs/1.0.0-pre4/leo_manager_0/work/queue
/usr/local/leofs/1.0.0-pre4/leo_manager_0/work/queue/membership

# cp -aT /usr/local/leofs/{0.14.6,1.0.0-pre4}/leo_manager_1/work
# find /usr/local/leofs/1.0.0-pre4/leo_manager_1/work/ -maxdepth 2
/usr/local/leofs/1.0.0-pre4/leo_manager_1/work/
/usr/local/leofs/1.0.0-pre4/leo_manager_1/work/mnesia
/usr/local/leofs/1.0.0-pre4/leo_manager_1/work/mnesia/127.0.0.1
/usr/local/leofs/1.0.0-pre4/leo_manager_1/work/queue
/usr/local/leofs/1.0.0-pre4/leo_manager_1/work/queue/membership

# /usr/local/leofs/current/leo_manager_0/bin/leo_manager start
# /usr/local/leofs/current/leo_manager_1/bin/leo_manager start

# telnet localhost 10010

status
[System config]
                System version : 1.0.0
                    Cluster Id : leofs_1
                         DC Id : dc_1
                Total replicas : 1
           # of successes of R : 1
           # of successes of W : 1
           # of successes of D : 1
 # of DC-awareness replicas    : 0
 # of Rack-awareness replicas  : 0
                     ring size : 2^128
             Current ring hash : c4ba139d
                Prev ring hash : c4ba139d

[Node(s) state]
-------+--------------------------------+--------------+----------------+----------------+----------------------------
 type  |              node              |    state     |  current ring  |   prev ring    |          updated at
-------+--------------------------------+--------------+----------------+----------------+----------------------------
  S    | storage_0@192.168.101.123      | running      | 0e6faed4       | 0e6faed4       | 2014-03-28 19:38:49 +0900
  S    | storage_0@192.168.101.124      | running      | 0e6faed4       | 0e6faed4       | 2014-03-28 19:38:49 +0900
  S    | storage_0@192.168.101.125      | running      | 0e6faed4       | 0e6faed4       | 2014-03-28 19:38:48 +0900
  S    | storage_0@192.168.101.126      | running      | 0e6faed4       | 0e6faed4       | 2014-03-28 19:38:48 +0900
  G    | gateway_0@192.168.101.122      | running      | 0e6faed4       | 0e6faed4       | 2014-03-28 19:38:49 +0900

suspend storage_0@192.168.101.123
[ERROR] Node not exist

On storage node

# /usr/local/leofs/current/leo_storage/bin/leo_storage stop
ok
# telnet manager_node 10010

status
[System config]
                System version : 1.0.0
                    Cluster Id : leofs_1
                         DC Id : dc_1
                Total replicas : 1
           # of successes of R : 1
           # of successes of W : 1
           # of successes of D : 1
 # of DC-awareness replicas    : 0
 # of Rack-awareness replicas  : 0
                     ring size : 2^128
             Current ring hash : c4ba139d
                Prev ring hash : c4ba139d

[Node(s) state]
-------+--------------------------------+--------------+----------------+----------------+----------------------------
 type  |              node              |    state     |  current ring  |   prev ring    |          updated at
-------+--------------------------------+--------------+----------------+----------------+----------------------------
  S    | storage_0@192.168.101.123      | stop         |                |                | 2014-03-28 22:07:24 +0900
  S    | storage_0@192.168.101.124      | running      | 0e6faed4       | 0e6faed4       | 2014-03-28 19:38:49 +0900
  S    | storage_0@192.168.101.125      | running      | 0e6faed4       | 0e6faed4       | 2014-03-28 19:38:48 +0900
  S    | storage_0@192.168.101.126      | running      | 0e6faed4       | 0e6faed4       | 2014-03-28 19:38:48 +0900
  G    | gateway_0@192.168.101.122      | running      | 0e6faed4       | 0e6faed4       | 2014-03-28 19:38:49 +0900

quit

# ln -sTf /usr/local/leofs/1.0.0-pre4 /usr/local/leofs/current
# ls -l /usr/local/leofs/current
lrwxrwxrwx 1 root root 27  3月 28 22:55 2014 /usr/local/leofs/current -> /usr/local/leofs/1.0.0-pre4

# cp -aT /usr/local/leofs/{0.14.6,1.0.0-pre4}/leo_storage/work
# find /usr/local/leofs/1.0.0-pre4/leo_storage/work/ -maxdepth 2
/usr/local/leofs/1.0.0-pre4/leo_storage/work/
/usr/local/leofs/1.0.0-pre4/leo_storage/work/mnesia
/usr/local/leofs/1.0.0-pre4/leo_storage/work/queue
/usr/local/leofs/1.0.0-pre4/leo_storage/work/queue/membership
/usr/local/leofs/1.0.0-pre4/leo_storage/work/queue/5
/usr/local/leofs/1.0.0-pre4/leo_storage/work/queue/4
/usr/local/leofs/1.0.0-pre4/leo_storage/work/queue/1
/usr/local/leofs/1.0.0-pre4/leo_storage/work/queue/2
/usr/local/leofs/1.0.0-pre4/leo_storage/work/queue/3

# /usr/local/leofs/current/leo_storage/bin/leo_storage start
# telnet manager_node 10010

status
[System config]
                System version : 1.0.0
                    Cluster Id : leofs_1
                         DC Id : dc_1
                Total replicas : 1
           # of successes of R : 1
           # of successes of W : 1
           # of successes of D : 1
 # of DC-awareness replicas    : 0
 # of Rack-awareness replicas  : 0
                     ring size : 2^128
             Current ring hash : c4ba139d
                Prev ring hash : c4ba139d

[Node(s) state]
-------+--------------------------------+--------------+----------------+----------------+----------------------------
 type  |              node              |    state     |  current ring  |   prev ring    |          updated at
-------+--------------------------------+--------------+----------------+----------------+----------------------------
  S    | storage_0@192.168.101.123      | restarted    | 000000-1       | 000000-1       | 2014-03-28 22:08:59 +0900
  S    | storage_0@192.168.101.124      | running      | 0e6faed4       | 0e6faed4       | 2014-03-28 19:38:49 +0900
  S    | storage_0@192.168.101.125      | running      | 0e6faed4       | 0e6faed4       | 2014-03-28 19:38:48 +0900
  S    | storage_0@192.168.101.126      | running      | 0e6faed4       | 0e6faed4       | 2014-03-28 19:38:48 +0900
  G    | gateway_0@192.168.101.122      | running      | 0e6faed4       | 0e6faed4       | 2014-03-28 19:38:49 +0900

resume storage_0@192.168.101.123
[ERROR] Node not exist

Error logs

At manager_0

...snip...
[E]     manager_0@192.168.101.121       2014-03-28 22:05:53.667124 +0900        139611953       leo_redundant_manager_api:get_redundancies_by_addr_id_1/4       516     "Could not retrieve redundancies"
[E]     manager_0@192.168.101.121       2014-03-28 22:06:04.89115 +0900 139611964       leo_redundant_manager_api:get_redundancies_by_addr_id_1/4       516     "Could not retrieve redundancies"

At manager_1

[E]     manager_1@192.168.101.121       2014-03-28 22:12:40.425586 +0900        139612360       leo_manager_cluster_monitor:register_fun_1/2    438     cause:{aborted,{no_exists,leo_gateway_nodes}}
[E]     manager_1@192.168.101.121       2014-03-28 22:12:42.394524 +0900        139612362       leo_ring_tbl_transformer:migrate_ring/2 109     {aborted,{no_exists,{leo_members_cur,disc_copies}}}
[E]     manager_1@192.168.101.121       2014-03-28 22:12:54.39485 +0900 139612374       leo_ring_tbl_transformer:migrate_ring/2 109     {aborted,{no_exists,{leo_members_cur,disc_copies}}}

At gateway_0

[W]     gateway_0@192.168.101.122       2014-03-28 22:00:14.605102 +0900        139611614       leo_gateway_api:register_in_monitor/3   146     manager:'manager_0@192.168.101.121', cause:timeout
[W]     gateway_0@192.168.101.122       2014-03-28 22:00:19.606119 +0900        139611619       leo_gateway_api:register_in_monitor/3   146     manager:'manager_1@192.168.101.121', cause:timeout
[W]     gateway_0@192.168.101.122       2014-03-28 22:00:34.610075 +0900        139611634       leo_gateway_api:register_in_monitor/3   146     manager:'manager_0@192.168.101.121', cause:timeout
[W]     gateway_0@192.168.101.122       2014-03-28 22:00:39.611114 +0900        139611639       leo_gateway_api:register_in_monitor/3   146     manager:'manager_1@192.168.101.121', cause:timeout
[W]     gateway_0@192.168.101.122       2014-03-28 22:00:54.616080 +0900        139611654       leo_gateway_api:register_in_monitor/3   146     manager:'manager_0@192.168.101.121', cause:timeout
[W]     gateway_0@192.168.101.122       2014-03-28 22:00:59.617103 +0900        139611659       leo_gateway_api:register_in_monitor/3   146     manager:'manager_1@192.168.101.121', cause:timeout
[W]     gateway_0@192.168.101.122       2014-03-28 22:01:14.621074 +0900        139611674       leo_gateway_api:register_in_monitor/3   146     manager:'manager_0@192.168.101.121', cause:timeout
[W]     gateway_0@192.168.101.122       2014-03-28 22:01:19.622110 +0900        139611679       leo_gateway_api:register_in_monitor/3   146     manager:'manager_1@192.168.101.121', cause:timeout
[E]     gateway_0@192.168.101.122       2014-03-28 22:08:09.863765 +0900        139612089       leo_membership:compare_with_remote_chksum/3     393     {'storage_0@192.168.101.123',nodedown}
[E]     gateway_0@192.168.101.122       2014-03-28 22:08:29.871564 +0900        139612109       leo_membership:compare_with_remote_chksum/3     393     {'storage_0@192.168.101.123',nodedown}
[E]     gateway_0@192.168.101.122       2014-03-28 22:08:59.898321 +0900        139612139       leo_membership:notify_error_to_manager/3        418     {'manager_0@192.168.101.121',{error,"Could not get member"}}

...snip...
yosukehara commented 10 years ago

Thank you for your report. We'll check this issue on next Monday.

Matsue commented 10 years ago

I forgot to change replication number setting on configuration file last time. I tried same test with fixed configuration file but it show same responses.

Status before upgrade

status
[System config]
                system version : 0.14.6
                total replicas : 2
           # of successes of R : 1
           # of successes of W : 1
           # of successes of D : 1
 # of DC-awareness replicas    : 0
 # of Rack-awareness replicas  : 0
                     ring size : 2^128
              ring hash (cur)  : 831f612a
              ring hash (prev) : 831f612a

[Node(s) state]
-------------------------------------------------------------------------------------------------------
 type node                          state       ring (cur)    ring (prev)   when
-------------------------------------------------------------------------------------------------------
 S    storage_0@192.168.101.123     running     831f612a      831f612a      2014-03-31 14:46:10 +0900
 S    storage_0@192.168.101.124     running     831f612a      831f612a      2014-03-31 14:46:10 +0900
 S    storage_0@192.168.101.125     running     831f612a      831f612a      2014-03-31 14:46:10 +0900
 S    storage_0@192.168.101.126     running     831f612a      831f612a      2014-03-31 14:46:10 +0900
 G    gateway_0@192.168.101.122     running     831f612a      831f612a      2014-03-31 14:46:10 +0900

Status after upgrade managers

status
[System config]
                System version : 1.0.0
                    Cluster Id : leofs_1
                         DC Id : dc_1
                Total replicas : 2
           # of successes of R : 1
           # of successes of W : 1
           # of successes of D : 1
 # of DC-awareness replicas    : 0
 # of Rack-awareness replicas  : 0
                     ring size : 2^128
             Current ring hash : 1503900f
                Prev ring hash : 1503900f

[Node(s) state]
-------+--------------------------------+--------------+----------------+----------------+----------------------------
 type  |              node              |    state     |  current ring  |   prev ring    |          updated at
-------+--------------------------------+--------------+----------------+----------------+----------------------------
  S    | storage_0@192.168.101.123      | running      | 831f612a       | 831f612a       | 2014-03-31 14:46:10 +0900
  S    | storage_0@192.168.101.124      | running      | 831f612a       | 831f612a       | 2014-03-31 14:46:10 +0900
  S    | storage_0@192.168.101.125      | running      | 831f612a       | 831f612a       | 2014-03-31 14:46:10 +0900
  S    | storage_0@192.168.101.126      | running      | 831f612a       | 831f612a       | 2014-03-31 14:46:10 +0900
  G    | gateway_0@192.168.101.122      | running      | 831f612a       | 831f612a       | 2014-03-31 14:46:10 +0900

status storage_0@192.168.101.123
[config]
            version : 0.14.4
        # of vnodes : 168
      group level-1 :
      group level-2 :
      obj-container : [[{path,"/leofs"},{num_of_containers,8}]]
            log dir : /usr/local/leofs/current/leo_storage/log

[status-1: ring]
  ring state (cur)  : 831f612a
  ring state (prev) : 831f612a

[status-2: erlang-vm]
         vm version : 5.9.3.1
    total mem usage : 26146288
   system mem usage : 11515568
    procs mem usage : 14616520
      ets mem usage : 775352
              procs : 231/1048576
        kernel_poll : true
   thread_pool_size : 32

[status-3: # of msgs]
   replication msgs : 0
    vnode-sync msgs : 0
     rebalance msgs : 0

suspend storage_0@192.168.101.123
[ERROR] Node not exist

Status after upgrade storage_0

status
[System config]
                System version : 1.0.0
                    Cluster Id : leofs_1
                         DC Id : dc_1
                Total replicas : 2
           # of successes of R : 1
           # of successes of W : 1
           # of successes of D : 1
 # of DC-awareness replicas    : 0
 # of Rack-awareness replicas  : 0
                     ring size : 2^128
             Current ring hash : 1503900f
                Prev ring hash : 1503900f

[Node(s) state]
-------+--------------------------------+--------------+----------------+----------------+----------------------------
 type  |              node              |    state     |  current ring  |   prev ring    |          updated at
-------+--------------------------------+--------------+----------------+----------------+----------------------------
  S    | storage_0@192.168.101.123      | restarted    | 000000-1       | 000000-1       | 2014-03-31 14:57:43 +0900
  S    | storage_0@192.168.101.124      | running      | 831f612a       | 831f612a       | 2014-03-31 14:46:10 +0900
  S    | storage_0@192.168.101.125      | running      | 831f612a       | 831f612a       | 2014-03-31 14:46:10 +0900
  S    | storage_0@192.168.101.126      | running      | 831f612a       | 831f612a       | 2014-03-31 14:46:10 +0900
  G    | gateway_0@192.168.101.122      | running      | 831f612a       | 831f612a       | 2014-03-31 14:46:10 +0900

status storage_0@192.168.101.123
[config]
            version : 1.0.0-pre3
        # of vnodes : 168
      group level-1 :
      group level-2 :
      obj-container : [[{path,"/leofs"},{num_of_containers,8}]]
            log dir : /usr/local/leofs/current/leo_storage/log/erlang

[status-1: ring]
  ring state (cur)  : 000000-1
  ring state (prev) : 000000-1

[status-2: erlang-vm]
         vm version : 5.9.3.1
    total mem usage : 55757880
   system mem usage : 45413488
    procs mem usage : 10365256
      ets mem usage : 4875992
              procs : 294/1048576
        kernel_poll : true
   thread_pool_size : 32

[status-3: # of msgs]
   replication msgs : 0
    vnode-sync msgs : 0
     rebalance msgs : 0

resume storage_0@192.168.101.123
[ERROR] Node not exist
yosukehara commented 10 years ago

Sharing my operaion log: