leo-project / leofs

The LeoFS Storage System
https://leo-project.net/leofs/
Apache License 2.0
1.56k stars 155 forks source link

[leo_manager][leo_storage] Inconsistent RING can happen when rebalancing the cluster #1078

Open mocchira opened 6 years ago

mocchira commented 6 years ago

Through the investigation for https://github.com/leo-project/leofs/issues/1076, I've found out the way to reproduce this problem.

In short, Trying to join the node which node name was previously used by another node which existed in the cluster can cause this RING inconsistent problem. For example, the below procedure can reproduce the problem.

Prerequisites

 [System Confiuration]
-----------------------------------+----------
 Item                              | Value
-----------------------------------+----------
 Basic/Consistency level
-----------------------------------+----------
                    system version | 1.3.8
                        cluster Id | leofs_1
                             DC Id | dc_1
                    Total replicas | 2
          number of successes of R | 1
          number of successes of W | 1
          number of successes of D | 1
 number of rack-awareness replicas | 0
                         ring size | 2^128
-----------------------------------+----------
 Multi DC replication settings
-----------------------------------+----------
 [mdcr] max number of joinable DCs | 2
 [mdcr] total replicas per a DC    | 1
 [mdcr] number of successes of R   | 1
 [mdcr] number of successes of W   | 1
 [mdcr] number of successes of D   | 1
-----------------------------------+----------
 Manager RING hash
-----------------------------------+----------
                 current ring-hash | fa2ce41b
                previous ring-hash | fa2ce41b
-----------------------------------+----------

 [State of Node(s)]
-------+--------------------------+--------------+----------------+----------------+----------------------------
 type  |           node           |    state     |  current ring  |   prev ring    |          updated at
-------+--------------------------+--------------+----------------+----------------+----------------------------
  S    | storage_1@127.0.0.1      | running      | fa2ce41b       | fa2ce41b       | 2018-07-19 14:49:38 +0900
  S    | storage_2@127.0.0.1      | running      | fa2ce41b       | fa2ce41b       | 2018-07-19 14:49:38 +0900
  S    | storage_3@127.0.0.1      | running      | fa2ce41b       | fa2ce41b       | 2018-07-19 14:49:38 +0900
  S    | storage_4@127.0.0.1      | running      | fa2ce41b       | fa2ce41b       | 2018-07-19 15:06:36 +0900
  G    | gateway_0@127.0.0.1      | running      | fa2ce41b       | fa2ce41b       | 2018-07-19 14:49:39 +0900
-------+--------------------------+--------------+----------------+----------------+----------------------------

Steps to reproduce

$ ./leofs-adm detach storage_1@127.0.0.1
$ ./package/leo_storage_5/bin/leo_storage start ## <- the new node taking over storage_1
$ ./package/leo_storage_1/bin/leo_storage stop
$ ./leofs-adm rebalance ## this will succeed as expected and we will have storage_[2-5]@127.0.0.1 here
$ watch ./leofs-adm mq-stats storage_2@127.0.0.1 ## wait for the rebalance finished
$ ./leofs-adm detach storage_2@127.0.0.1
$ ./package/leo_storage_8/bin/leo_storage start ## <= the new node taking over storage_2 and its name is set to **storage_1@127.0.0.1** previously used and existed in the cluster
$ ./leofs-adm rebalance ## <- this will cause the inconsistent RING and some storage nodes will start to dump the error logs from here

The result

 [System Confiuration]
-----------------------------------+----------
 Item                              | Value
-----------------------------------+----------
 Basic/Consistency level
-----------------------------------+----------
                    system version | 1.3.8
                        cluster Id | leofs_1
                             DC Id | dc_1
                    Total replicas | 2
          number of successes of R | 1
          number of successes of W | 1
          number of successes of D | 1
 number of rack-awareness replicas | 0
                         ring size | 2^128
-----------------------------------+----------
 Multi DC replication settings
-----------------------------------+----------
 [mdcr] max number of joinable DCs | 2
 [mdcr] total replicas per a DC    | 1
 [mdcr] number of successes of R   | 1
 [mdcr] number of successes of W   | 1
 [mdcr] number of successes of D   | 1
-----------------------------------+----------
 Manager RING hash
-----------------------------------+----------
                 current ring-hash | 95248ae9
                previous ring-hash | c78a7bf5
-----------------------------------+----------

 [State of Node(s)]
-------+--------------------------+--------------+----------------+----------------+----------------------------
 type  |           node           |    state     |  current ring  |   prev ring    |          updated at
-------+--------------------------+--------------+----------------+----------------+----------------------------
  S    | storage_1@127.0.0.1      | running      | bfdb5bf3       | bfdb5bf3       | 2018-07-19 15:29:54 +0900
  S    | storage_3@127.0.0.1      | running      | bfdb5bf3       | bfdb5bf3       | 2018-07-19 14:49:38 +0900
  S    | storage_4@127.0.0.1      | running      | bfdb5bf3       | bfdb5bf3       | 2018-07-19 15:06:36 +0900
  S    | storage_5@127.0.0.1      | running      | bfdb5bf3       | bfdb5bf3       | 2018-07-19 15:20:22 +0900
  G    | gateway_0@127.0.0.1      | running      | bfdb5bf3       | bfdb5bf3       | 2018-07-19 14:49:39 +0900
-------+--------------------------+--------------+----------------+----------------+----------------------------

As you can see, the Manager RING hash differs from the one on storage nodes.

mocchira commented 6 years ago

I'd like to summarize what happens when hitting this issue.

What happens

That being said, all we have to do are

TODO

In order to

The RING hash displayed on leofs-adm status will be changed to the same value with storage node's one.

vstax commented 6 years ago

@mocchira Thanks! The RING is in order now, cur/prev version on manager is the same as it was on the nodes right now.

node01 still refuses to start (with del_dir removed). Maybe full logs will help, info:

[I] node01@10.3.15.101  2018-07-20 14:53:23.945866 +0300    1532087603  null:null   0   Application leo_ordning_reda started on node 'node01@10.3.15.101'
[I] node01@10.3.15.101  2018-07-20 14:53:24.3478 +0300  1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,mq_persistent_node_consumer_3_1},{publisher_id,mq_persistent_node},{interval,10},{batch_of_msgs,5000}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.4911 +0300  1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,mq_persistent_node_consumer_2_1},{publisher_id,mq_persistent_node},{interval,10},{batch_of_msgs,5000}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.5391 +0300  1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,mq_persistent_node_consumer_1_1},{publisher_id,mq_persistent_node},{interval,10},{batch_of_msgs,5000}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.43473 +0300 1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_per_object_queue_consumer_4_1},{publisher_id,leo_per_object_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.43889 +0300 1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_per_object_queue_consumer_3_1},{publisher_id,leo_per_object_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.44350 +0300 1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_per_object_queue_consumer_2_1},{publisher_id,leo_per_object_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.44771 +0300 1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_per_object_queue_consumer_1_1},{publisher_id,leo_per_object_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.79750 +0300 1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_sync_by_vnode_id_queue_consumer_4_1},{publisher_id,leo_sync_by_vnode_id_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.80153 +0300 1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_sync_by_vnode_id_queue_consumer_3_1},{publisher_id,leo_sync_by_vnode_id_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.80625 +0300 1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_sync_by_vnode_id_queue_consumer_2_1},{publisher_id,leo_sync_by_vnode_id_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.81014 +0300 1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_sync_by_vnode_id_queue_consumer_1_1},{publisher_id,leo_sync_by_vnode_id_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.117022 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_rebalance_queue_consumer_4_1},{publisher_id,leo_rebalance_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.117530 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_rebalance_queue_consumer_3_1},{publisher_id,leo_rebalance_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.118142 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_rebalance_queue_consumer_2_1},{publisher_id,leo_rebalance_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.118660 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_rebalance_queue_consumer_1_1},{publisher_id,leo_rebalance_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.154766 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_async_deletion_queue_consumer_4_1},{publisher_id,leo_async_deletion_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.155529 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_async_deletion_queue_consumer_3_1},{publisher_id,leo_async_deletion_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.156228 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_async_deletion_queue_consumer_2_1},{publisher_id,leo_async_deletion_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.156930 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_async_deletion_queue_consumer_1_1},{publisher_id,leo_async_deletion_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.193906 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_recovery_node_queue_consumer_4_1},{publisher_id,leo_recovery_node_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.194728 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_recovery_node_queue_consumer_3_1},{publisher_id,leo_recovery_node_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.195499 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_recovery_node_queue_consumer_2_1},{publisher_id,leo_recovery_node_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.196282 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_recovery_node_queue_consumer_1_1},{publisher_id,leo_recovery_node_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.232935 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_sync_obj_with_dc_queue_consumer_4_1},{publisher_id,leo_sync_obj_with_dc_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.233833 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_sync_obj_with_dc_queue_consumer_3_1},{publisher_id,leo_sync_obj_with_dc_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.234682 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_sync_obj_with_dc_queue_consumer_2_1},{publisher_id,leo_sync_obj_with_dc_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.235525 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_sync_obj_with_dc_queue_consumer_1_1},{publisher_id,leo_sync_obj_with_dc_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.273582 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_comp_meta_with_dc_queue_consumer_4_1},{publisher_id,leo_comp_meta_with_dc_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.274794 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_comp_meta_with_dc_queue_consumer_3_1},{publisher_id,leo_comp_meta_with_dc_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.275932 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_comp_meta_with_dc_queue_consumer_2_1},{publisher_id,leo_comp_meta_with_dc_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.276831 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_comp_meta_with_dc_queue_consumer_1_1},{publisher_id,leo_comp_meta_with_dc_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.314815 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_req_delete_dir_queue_consumer_4_1},{publisher_id,leo_req_delete_dir_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.315622 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_req_delete_dir_queue_consumer_3_1},{publisher_id,leo_req_delete_dir_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.316345 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_req_delete_dir_queue_consumer_2_1},{publisher_id,leo_req_delete_dir_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.317040 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_req_delete_dir_queue_consumer_1_1},{publisher_id,leo_req_delete_dir_queue},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.353031 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_1_consumer_4_1},{publisher_id,leo_delete_dir_queue_1},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.353718 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_1_consumer_3_1},{publisher_id,leo_delete_dir_queue_1},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.354382 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_1_consumer_2_1},{publisher_id,leo_delete_dir_queue_1},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.355080 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_1_consumer_1_1},{publisher_id,leo_delete_dir_queue_1},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.392008 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_2_consumer_4_1},{publisher_id,leo_delete_dir_queue_2},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.392689 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_2_consumer_3_1},{publisher_id,leo_delete_dir_queue_2},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.393406 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_2_consumer_2_1},{publisher_id,leo_delete_dir_queue_2},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.394080 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_2_consumer_1_1},{publisher_id,leo_delete_dir_queue_2},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.440731 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_3_consumer_4_1},{publisher_id,leo_delete_dir_queue_3},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.441412 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_3_consumer_3_1},{publisher_id,leo_delete_dir_queue_3},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.442032 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_3_consumer_2_1},{publisher_id,leo_delete_dir_queue_3},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.442693 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_3_consumer_1_1},{publisher_id,leo_delete_dir_queue_3},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.477428 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_4_consumer_4_1},{publisher_id,leo_delete_dir_queue_4},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.478075 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_4_consumer_3_1},{publisher_id,leo_delete_dir_queue_4},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.478706 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_4_consumer_2_1},{publisher_id,leo_delete_dir_queue_4},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.479369 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_4_consumer_1_1},{publisher_id,leo_delete_dir_queue_4},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.515991 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_5_consumer_4_1},{publisher_id,leo_delete_dir_queue_5},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.516676 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_5_consumer_3_1},{publisher_id,leo_delete_dir_queue_5},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.517335 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_5_consumer_2_1},{publisher_id,leo_delete_dir_queue_5},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.517976 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_5_consumer_1_1},{publisher_id,leo_delete_dir_queue_5},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.554507 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_6_consumer_4_1},{publisher_id,leo_delete_dir_queue_6},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.555228 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_6_consumer_3_1},{publisher_id,leo_delete_dir_queue_6},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.555903 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_6_consumer_2_1},{publisher_id,leo_delete_dir_queue_6},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.556622 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_6_consumer_1_1},{publisher_id,leo_delete_dir_queue_6},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.593033 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_7_consumer_4_1},{publisher_id,leo_delete_dir_queue_7},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.593744 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_7_consumer_3_1},{publisher_id,leo_delete_dir_queue_7},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.594403 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_7_consumer_2_1},{publisher_id,leo_delete_dir_queue_7},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.595041 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_7_consumer_1_1},{publisher_id,leo_delete_dir_queue_7},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.630688 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_8_consumer_4_1},{publisher_id,leo_delete_dir_queue_8},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.631353 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_8_consumer_3_1},{publisher_id,leo_delete_dir_queue_8},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.632014 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_8_consumer_2_1},{publisher_id,leo_delete_dir_queue_8},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.632664 +0300    1532087604  null:null   0   {module,"leo_mq_consumer"},{function,"init/1"},{line,171},{body,[{id,leo_delete_dir_queue_8_consumer_1_1},{publisher_id,leo_delete_dir_queue_8},{interval,500},{batch_of_msgs,1600}]}
[I] node01@10.3.15.101  2018-07-20 14:53:24.693174 +0300    1532087604  leo_storage_handler_del_directory:terminate/2   232 [{"terminate with reason",{noproc,{gen_server,call,[undefined,{get,<<"develop/test/27.mp4\n4\n">>},30000]}}}]
[I] node01@10.3.15.101  2018-07-20 14:53:24.796124 +0300    1532087604  null:null   0   Application leo_rpc started on node 'node01@10.3.15.101'
[I] node01@10.3.15.101  2018-07-20 14:53:24.817256 +0300    1532087604  leo_storage_handler_del_directory:terminate/2   232 [{"terminate with reason",{noproc,{gen_server,call,[undefined,{get,<<"develop/test/26.mp4\n1\n">>},30000]}}}]
[I] node01@10.3.15.101  2018-07-20 14:53:24.868082 +0300    1532087604  leo_storage_handler_del_directory:terminate/2   232 [{"terminate with reason",{noproc,{gen_server,call,[undefined,{get,<<"develop/test/26.mp4\n3\n">>},30000]}}}]
[I] node01@10.3.15.101  2018-07-20 14:53:24.888621 +0300    1532087604  leo_storage_handler_del_directory:terminate/2   232 [{"terminate with reason",{noproc,{gen_server,call,[undefined,{get,<<"develop/test/27.mp4\n2\n">>},30000]}}}]
[I] node01@10.3.15.101  2018-07-20 14:53:24.915722 +0300    1532087604  leo_storage_handler_del_directory:terminate/2   232 [{"terminate with reason",{noproc,{gen_server,call,[undefined,{get,<<"develop/test/27.mp4\n1\n">>},30000]}}}]
[I] node01@10.3.15.101  2018-07-20 14:53:24.935328 +0300    1532087604  null:null   0   Application mnesia started on node 'node01@10.3.15.101'
[I] node01@10.3.15.101  2018-07-20 14:53:25.119988 +0300    1532087605  leo_storage_handler_del_directory:terminate/2   232 [{"terminate with reason",{noproc,{gen_server,call,[undefined,{get,<<"develop/test/26.mp4\n4\n">>},30000]}}}]
[I] node01@10.3.15.101  2018-07-20 14:53:25.191023 +0300    1532087605  null:null   0   Application snmp started on node 'node01@10.3.15.101'
[I] node01@10.3.15.101  2018-07-20 14:53:26.388456 +0300    1532087606  null:null   0   Application leo_statistics started on node 'node01@10.3.15.101'

error:

[E] node01@10.3.15.101  2018-07-20 14:53:24.693908 +0300    1532087604  gen_server:call 0   gen_server leo_storage_handler_del_directory terminated with reason: no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/27.mp4\n4\n">>}, 30000) in gen_server:call/3 line 212
[E] node01@10.3.15.101  2018-07-20 14:53:24.703060 +0300    1532087604  gen_server:call 0   CRASH REPORT Process leo_storage_handler_del_directory with 0 neighbours exited with reason: no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/27.mp4\n4\n">>}, 30000) in gen_server:terminate/7 line 812
[E] node01@10.3.15.101  2018-07-20 14:53:24.703841 +0300    1532087604  gen_server:call 0   Supervisor leo_storage_sup had child leo_storage_handler_del_directory started with leo_storage_handler_del_directory:start_link() at <0.5420.0> exit with reason no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/27.mp4\n4\n">>}, 30000) in context child_terminated
[E] node01@10.3.15.101  2018-07-20 14:53:24.817950 +0300    1532087604  gen_server:call 0   gen_server leo_storage_handler_del_directory terminated with reason: no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/26.mp4\n1\n">>}, 30000) in gen_server:call/3 line 212
[E] node01@10.3.15.101  2018-07-20 14:53:24.818550 +0300    1532087604  gen_server:call 0   CRASH REPORT Process leo_storage_handler_del_directory with 0 neighbours exited with reason: no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/26.mp4\n1\n">>}, 30000) in gen_server:terminate/7 line 812
[E] node01@10.3.15.101  2018-07-20 14:53:24.819080 +0300    1532087604  gen_server:call 0   Supervisor leo_storage_sup had child leo_storage_handler_del_directory started with leo_storage_handler_del_directory:start_link() at <0.5430.0> exit with reason no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/26.mp4\n1\n">>}, 30000) in context child_terminated
[E] node01@10.3.15.101  2018-07-20 14:53:24.868776 +0300    1532087604  gen_server:call 0   gen_server leo_storage_handler_del_directory terminated with reason: no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/26.mp4\n3\n">>}, 30000) in gen_server:call/3 line 212
[E] node01@10.3.15.101  2018-07-20 14:53:24.869267 +0300    1532087604  gen_server:call 0   CRASH REPORT Process leo_storage_handler_del_directory with 0 neighbours exited with reason: no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/26.mp4\n3\n">>}, 30000) in gen_server:terminate/7 line 812
[E] node01@10.3.15.101  2018-07-20 14:53:24.869732 +0300    1532087604  gen_server:call 0   Supervisor leo_storage_sup had child leo_storage_handler_del_directory started with leo_storage_handler_del_directory:start_link() at <0.5587.0> exit with reason no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/26.mp4\n3\n">>}, 30000) in context child_terminated
[E] node01@10.3.15.101  2018-07-20 14:53:24.889295 +0300    1532087604  gen_server:call 0   gen_server leo_storage_handler_del_directory terminated with reason: no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/27.mp4\n2\n">>}, 30000) in gen_server:call/3 line 212
[E] node01@10.3.15.101  2018-07-20 14:53:24.889722 +0300    1532087604  gen_server:call 0   CRASH REPORT Process leo_storage_handler_del_directory with 0 neighbours exited with reason: no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/27.mp4\n2\n">>}, 30000) in gen_server:terminate/7 line 812
[E] node01@10.3.15.101  2018-07-20 14:53:24.890096 +0300    1532087604  gen_server:call 0   Supervisor leo_storage_sup had child leo_storage_handler_del_directory started with leo_storage_handler_del_directory:start_link() at <0.5603.0> exit with reason no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/27.mp4\n2\n">>}, 30000) in context child_terminated
[E] node01@10.3.15.101  2018-07-20 14:53:24.916608 +0300    1532087604  gen_server:call 0   gen_server leo_storage_handler_del_directory terminated with reason: no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/27.mp4\n1\n">>}, 30000) in gen_server:call/3 line 212
[E] node01@10.3.15.101  2018-07-20 14:53:24.917065 +0300    1532087604  gen_server:call 0   CRASH REPORT Process leo_storage_handler_del_directory with 0 neighbours exited with reason: no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/27.mp4\n1\n">>}, 30000) in gen_server:terminate/7 line 812
[E] node01@10.3.15.101  2018-07-20 14:53:24.917640 +0300    1532087604  gen_server:call 0   Supervisor leo_storage_sup had child leo_storage_handler_del_directory started with leo_storage_handler_del_directory:start_link() at <0.5610.0> exit with reason no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/27.mp4\n1\n">>}, 30000) in context child_terminated
[E] node01@10.3.15.101  2018-07-20 14:53:25.120532 +0300    1532087605  gen_server:call 0   gen_server leo_storage_handler_del_directory terminated with reason: no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/26.mp4\n4\n">>}, 30000) in gen_server:call/3 line 212
[E] node01@10.3.15.101  2018-07-20 14:53:25.120840 +0300    1532087605  gen_server:call 0   CRASH REPORT Process leo_storage_handler_del_directory with 0 neighbours exited with reason: no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/26.mp4\n4\n">>}, 30000) in gen_server:terminate/7 line 812
[E] node01@10.3.15.101  2018-07-20 14:53:25.121222 +0300    1532087605  gen_server:call 0   Supervisor leo_storage_sup had child leo_storage_handler_del_directory started with leo_storage_handler_del_directory:start_link() at <0.5617.0> exit with reason no such process or port in call to gen_server:call(undefined, {get,<<"develop/test/26.mp4\n4\n">>}, 30000) in context child_terminated
[E] node01@10.3.15.101  2018-07-20 14:53:25.121490 +0300    1532087605  null:null   0   Supervisor leo_storage_sup had child leo_storage_handler_del_directory started with leo_storage_handler_del_directory:start_link() at <0.5617.0> exit with reason reached_max_restart_intensity in context shutdown
[E] node01@10.3.15.101  2018-07-20 14:53:25.154555 +0300    1532087605  null:null   0   Error in process <0.5671.0> on node 'node01@10.3.15.101' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] node01@10.3.15.101  2018-07-20 14:53:25.181681 +0300    1532087605  null:null   0   Error in process <0.5679.0> on node 'node01@10.3.15.101' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] node01@10.3.15.101  2018-07-20 14:53:25.210710 +0300    1532087605  null:null   0   Error in process <0.5680.0> on node 'node01@10.3.15.101' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] node01@10.3.15.101  2018-07-20 14:53:25.211059 +0300    1532087605  null:null   0   Error in process <0.5681.0> on node 'node01@10.3.15.101' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] node01@10.3.15.101  2018-07-20 14:53:25.243857 +0300    1532087605  null:null   0   Error in process <0.5686.0> on node 'node01@10.3.15.101' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
[E] node01@10.3.15.101  2018-07-20 14:53:27.153283 +0300    1532087607  null:null   0   Supervisor leo_storage_sup had child leo_object_storage_sup started with leo_object_storage_sup:start_link([{64,"/mnt/disk0"},{64,"/mnt/disk1"},{64,"/mnt/disk2"},{64,"/mnt/disk3"},{64,"/mnt/disk4"},{64,"..."}], leo_storage_msg_collector) at <0.106.0> exit with reason killed in context shutdown_error
[E] node01@10.3.15.101  2018-07-20 14:53:27.758629 +0300    1532087607  null:null   0   Error in process <0.5731.0> on node 'node01@10.3.15.101' with exit value:
{badarg,[{ets,insert,[leo_env_values,{{env,leo_redundant_manager,options},[{version,1},{cluster_id,leofs_1},{dc_id,dc_1},{n,3},{r,1},{w,1},{d,1},{bit_of_ring,128},{num_of_dc_replicas,1},{mdcr_r,1},{mdcr_w,1},{mdcr_d,1},{num_of_rack_replicas,0},{max_mdc_targets,2}]}],[]},{leo_misc,set_env,3,[{file,"src/leo_misc.erl"},{line,143}]},{leo_redundant_manager_api,set_options,1,[{file,"src/leo_redundant_manager_api.erl"},{line,203}]},{leo_storage_api,register_in_monitor_1,3,[{file,"src/leo_storage_api.erl"},{line,110}]},{leo_storage_api,register_in_monitor,2,[{file,"src/leo_storage_api.erl"},{line,74}]}]}
[E] node01@10.3.15.101  2018-07-20 14:53:29.754452 +0300    1532087609  leo_storage_app:after_proc_1/3  235 [{cause,"Not initialize leo_backend_db_sup"}]
[E] node01@10.3.15.101  2018-07-20 14:53:29.754967 +0300    1532087609  null:null   0   {module,"leo_storage_app"},{function,"after_proc_1/3"},{line,224},{body,"Could NOT start backend-db sup"}
[E] node01@10.3.15.101  2018-07-20 14:53:29.755459 +0300    1532087609  leo_storage_app:start   0   CRASH REPORT Process <0.44.0> with 0 neighbours exited with reason: bad return value ok from leo_storage_app:start(normal, []) in application_master:init/4 line 134

crash.log (last part; before that there are few others ERROR REPORTS regarding different files, they look exactly like the one I quoted here):

2018-07-20 14:53:25 =ERROR REPORT====
** Generic server leo_storage_handler_del_directory terminating 
** Last message in was {enqueue,1,<<"develop/test/26.mp4\n4\n">>,false}
** When Server state == {state,8,['mgr01@10.3.15.21','mgr02@10.3.15.22'],[]}
** Reason for termination == 
** {{noproc,{gen_server,call,[undefined,{get,<<"develop/test/26.mp4\n4\n">>},30000]}},[{gen_server,call,3,[{file,"gen_server.erl"},{line,212}]},{leo_storage_handler_del_directory,handle_call,3,[{file,"src/leo_storage_handler_del_directory.erl"},{line,129}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,615}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,647}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
2018-07-20 14:53:25 =CRASH REPORT====
  crasher:
    initial call: leo_storage_handler_del_directory:init/1
    pid: <0.5617.0>
    registered_name: leo_storage_handler_del_directory
    exception exit: {{noproc,{gen_server,call,[undefined,{get,<<"develop/test/26.mp4\n4\n">>},30000]}},[{gen_server,terminate,7,[{file,"gen_server.erl"},{line,812}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
    ancestors: [leo_storage_sup,<0.45.0>]
    messages: []
    links: [<0.62.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 376
    stack_size: 27
    reductions: 858
  neighbours:
2018-07-20 14:53:25 =SUPERVISOR REPORT====
     Supervisor: {local,leo_storage_sup}
     Context:    child_terminated
     Reason:     {noproc,{gen_server,call,[undefined,{get,<<"develop/test/26.mp4\n4\n">>},30000]}}
     Offender:   [{pid,<0.5617.0>},{id,leo_storage_handler_del_directory},{mfargs,{leo_storage_handler_del_directory,start_link,[]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]

2018-07-20 14:53:25 =SUPERVISOR REPORT====
     Supervisor: {local,leo_storage_sup}
     Context:    shutdown
     Reason:     reached_max_restart_intensity
     Offender:   [{pid,<0.5617.0>},{id,leo_storage_handler_del_directory},{mfargs,{leo_storage_handler_del_directory,start_link,[]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]

2018-07-20 14:53:25 =ERROR REPORT====
Error in process <0.5671.0> on node 'node01@10.3.15.101' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
2018-07-20 14:53:25 =ERROR REPORT====
Error in process <0.5679.0> on node 'node01@10.3.15.101' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
2018-07-20 14:53:25 =ERROR REPORT====
Error in process <0.5680.0> on node 'node01@10.3.15.101' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
2018-07-20 14:53:25 =ERROR REPORT====
Error in process <0.5681.0> on node 'node01@10.3.15.101' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
2018-07-20 14:53:25 =ERROR REPORT====
Error in process <0.5686.0> on node 'node01@10.3.15.101' with exit value:
{badarg,[{gen_fsm,send_event,2,[{file,"gen_fsm.erl"},{line,215}]}]}
2018-07-20 14:53:27 =SUPERVISOR REPORT====
     Supervisor: {local,leo_storage_sup}
     Context:    shutdown_error
     Reason:     killed
     Offender:   [{pid,<0.106.0>},{id,leo_object_storage_sup},{mfargs,{leo_object_storage_sup,start_link,[[{64,"/mnt/disk0"},{64,"/mnt/disk1"},{64,"/mnt/disk2"},{64,"/mnt/disk3"},{64,"/mnt/disk4"},{64,"/mnt/disk5"}],leo_storage_msg_collector]}},{restart_type,permanent},{shutdown,2000},{child_type,supervisor}]

2018-07-20 14:53:27 =ERROR REPORT====
Error in process <0.5731.0> on node 'node01@10.3.15.101' with exit value:
{badarg,[{ets,insert,[leo_env_values,{{env,leo_redundant_manager,options},[{version,1},{cluster_id,leofs_1},{dc_id,dc_1},{n,3},{r,1},{w,1},{d,1},{bit_of_ring,128},{num_of_dc_replicas,1},{mdcr_r,1},{mdcr_w,1},{mdcr_d,1},{num_of_rack_replicas,0},{max_mdc_targets,2}]}],[]},{leo_misc,set_env,3,[{file,"src/leo_misc.erl"},{line,143}]},{leo_redundant_manager_api,set_options,1,[{file,"src/leo_redundant_manager_api.erl"},{line,203}]},{leo_storage_api,register_in_monitor_1,3,[{file,"src/leo_storage_api.erl"},{line,110}]},{leo_storage_api,register_in_monitor,2,[{file,"src/leo_storage_api.erl"},{line,74}]}]}
2018-07-20 14:53:29 =ERROR REPORT====
{module,"leo_storage_app"},{function,"after_proc_1/3"},{line,224},{body,"Could NOT start backend-db sup"}
2018-07-20 14:53:29 =CRASH REPORT====
  crasher:
    initial call: application_master:init/4
    pid: <0.44.0>
    registered_name: []
    exception exit: {{bad_return,{{leo_storage_app,start,[normal,[]]},ok}},[{application_master,init,4,[{file,"application_master.erl"},{line,134}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
    ancestors: [<0.43.0>]
    messages: [{'EXIT',<0.45.0>,normal}]
    links: [<0.43.0>,<0.9.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 610
    stack_size: 27
    reductions: 1649
  neighbours:

Should we remove all queues on all nodes or wait for some other solution?

mocchira commented 6 years ago

@vstax

Should we remove all queues on all nodes or wait for some other solution?

Yes as I've answered at https://github.com/leo-project/leofs/issues/1076#issuecomment-406536504

The below is the relevant information quoted from the above comment.

but I've just noticed there might be a racy problem that if leo_storage still doesn't initialize the leo_backend_db for delete-(bucket|directory) receive delete requests from other nodes then the above errors can happen so as I commented on #1078 (comment), deleting all queue files in > the cluster and issuing mq-suspend leo_delete_dirqueue[1-8] on all storage nodes except node01 > then restarting node01 may work. Please give it a try

So the procedure you would have to take is

OR

Wait for the racy problem I described at https://github.com/leo-project/leofs/issues/1076#issuecomment-406539455 get fixed and cherry-pick the fix into your LeoFS build. (the fix will just move one line to the other position so it'd be easy to cherry-pick the commit I think)

The latter option would be available three or four days later (my next working day) so if you can't wait for it, then the former one would be your choice.

vstax commented 6 years ago

@mocchira Ok, I'd like to go with the first plan, but I have a question: do you mean deleting queue subdirectories with node running? Is it safe? Because I was thinking of shutting down all storage nodes for a short time and removing directories when no nodes are running, then starting all nodes (with node01 probably being the last).

(cherry-picking/building and installing would be pretty hard for this cluster, unfortunately, as I'm not operating it and just helping to solve this problem - this cluster never been upgraded since initially installed on 1.3.8, for example, because upgrade wasn't yet tested by its operator)

mocchira commented 6 years ago

@vstax

(cherry-picking/building and installing would be pretty hard for this cluster, unfortunately, as I'm not operating it and just helping to solve this problem - this cluster never been upgraded since initially installed on 1.3.8, for example, because upgrade wasn't yet tested by its operator)

Got it.

Ok, I'd like to go with the first plan, but I have a question: do you mean deleting queue subdirectories with node running? Is it safe?

No. It's not safe so

Because I was thinking of shutting down all storage nodes for a short time and removing directories when no nodes are running, then starting all nodes (with node01 probably being the last).

The above procedure you commented would be best if it's allowed to stop all nodes for a short time. One thing I would like to suggest is "starting node01 at first". Because the racy problem could happen in case the other nodes have delete messages in their queue and send those messages to node01 during its startup phase so starting node01 at first would avoid this problem.

vstax commented 6 years ago

@mocchira Thanks, the procedure with removing all queues worked. node01 is up and running. Now will be executing recover-node for all nodes for now (starting with node09), one by one to make sure that data is not lost.

However, the original problem with errors in logs of node01 has returned. There are messages in logs like

[E] node01@10.3.15.101  2018-07-20 18:19:09.22301 +0300 1532099949  leo_storage_handler_object:get/3    158 [{from,gateway},{method,get},{key,<<"production/promotion/247/1.jpg">>},{req_id,9680055},{cause,unavailable}]

Whereis output:

leofs-adm whereis production/promotion/247/1.jpg
-------+-------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |          node           |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+-------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | node01@10.3.15.101      | ce61c5f15e4a56bd280588f5c6e6ef15     |        96K |   635bf4a355 | false          |              0 | 56b47a1211c1a  | 2018-05-03 09:51:13 +0300
       | node06@10.3.15.106      | ce61c5f15e4a56bd280588f5c6e6ef15     |        96K |   635bf4a355 | false          |              0 | 56b47a1211c1a  | 2018-05-03 09:51:13 +0300
       | node09@10.3.15.109      | ce61c5f15e4a56bd280588f5c6e6ef15     |        96K |   635bf4a355 | false          |              0 | 56b47a1211c1a  | 2018-05-03 09:51:13 +0300

Another strange one is

[W] node01@10.3.15.101  2018-07-20 18:21:12.773669 +0300    1532100072  leo_storage_handler_object:read_and_repair_3/3  1340    [{key,<<"production-thumbnail/parts/new/1935059/1_116x88.png">>},{cause,timeout}]
leofs-adm whereis production-thumbnail/parts/new/1935059/1_116x88.png
-------+-------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |          node           |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+-------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | node07@10.3.15.107      |                                      |            |              |                |                |                | 
       | node01@10.3.15.101      |                                      |            |              |                |                |                | 
       | node06@10.3.15.106      |                                      |            |              |                |                |                | 

Anyhow, these errors appear only on node01, and there are quite a lot of them. No errors on any of the other nodes (other than some related to timeouts during recover-node operations and cause,not_found ones when objects that don't exist are requested).

mocchira commented 6 years ago

Thanks, the procedure with removing all queues worked. node01 is up and running. Now will be executing recover-node for all nodes for now (starting with node09), one by one to make sure that data is not lost.

Great to hear that.

However, the original problem with errors in logs of node01 has returned. There are messages in logs like Another strange one is

Anyhow, these errors appear only on node01, and there are quite a lot of them. No errors on any of the other nodes (other than some related to timeouts during recover-node operations and cause,not_found ones when objects that don't exist are requested).

I see. This might be something other problem happening only on node01. I will dig in from the next working day.

mocchira commented 6 years ago
mocchira commented 6 years ago

@vstax Given that

That being said, {cause,unavailable} is most likely to be caused by the last one so RING managed by leo_redundant_manager on node01 might have some problem. Can you share the dump-ring files on node01? If you find any difference between node01's RING and other's one then leofs-adm recover-ring node01@10.3.15.101 may solve the problem.

vstax commented 6 years ago

@mocchira Yes, watchdogs are in default configuration (just rex), no compactions and process_limit was never reached (procs | 3586/1048576). All configuration and stats (everything default, really, except for AVS directories and num_of_containers) are identical between nodes. The only difference between node01 and the others is 64 AVS containers per directory and troubles that followed the takeover.

I do not see any difference between RINGs between storage nodes and gateways, there are some differences between RING on storage nodes and manager, only in timestamps:

$ diff -U0 ring_cur.dump.node05 ring_cur.dump.manager 
--- ring_cur.dump.node05    2018-07-22 11:21:48.162321481 +0300
+++ ring_cur.dump.manager   2018-07-22 11:20:36.316372479 +0300
@@ -2 +2 @@
-{752169489965206552452318705802254833,'node01@10.3.15.101',1532099543627955}.
+{752169489965206552452318705802254833,'node01@10.3.15.101',1531580613070271}.
@@ -6 +6 @@
-{1863911322083805729876378402340932715,'node01@10.3.15.101',1532099543627955}.
+{1863911322083805729876378402340932715,'node01@10.3.15.101',1531580613070271}.
@@ -12,2 +12,2 @@
-{5487687885213189340319180120063317160,'node01@10.3.15.101',1532099543627955}.
-{6099844769545649456419550801817607840,'node01@10.3.15.101',1532099543627955}.
+{5487687885213189340319180120063317160,'node01@10.3.15.101',1531580613070271}.
+{6099844769545649456419550801817607840,'node01@10.3.15.101',1531580613070271}.
@@ -19,3 +19,3 @@
-{10502214423935970566910683396591078631,'node01@10.3.15.101',1532099543627955}.
-{11726269121385565646524378420040387246,'node01@10.3.15.101',1532099543627955}.
-{11781322914419147355105744823461180549,'node01@10.3.15.101',1532099543627955}.
+{10502214423935970566910683396591078631,'node01@10.3.15.101',1531580613070271}.
+{11726269121385565646524378420040387246,'node01@10.3.15.101',1531580613070271}.
+{11781322914419147355105744823461180549,'node01@10.3.15.101',1531580613070271}.
@@ -23 +23 @@
-{14305833216997888477823880572566442104,'node01@10.3.15.101',1532099543627955}.
+{14305833216997888477823880572566442104,'node01@10.3.15.101',1531580613070271}.

[skipped]

@@ -1702 +1702 @@
- 1532099543627955}.
+ 1531580613070271}.
@@ -1716 +1716 @@
- 1532099543627955}.
+ 1531580613070271}.
@@ -1720 +1720 @@
- 1532099543627955}.
+ 1531580613070271}.

The other thing that I've noticed is that *_worker.log files are 1.5 times larger now, you can compare to the older ones posted here https://github.com/leo-project/leofs/issues/1076#issuecomment-405280094 - despite RING (and .dump) files themselves being essentially the same (at least on storage nodes).

Anyhow dumps from manager: manager.tar.gz And from node01 (identical between all nodes and gateways): node01.tar.gz

Just in case, status output for manager and the first node:

 [System Confiuration]
-----------------------------------+----------
 Item                              | Value    
-----------------------------------+----------
 Basic/Consistency level
-----------------------------------+----------
                    system version | 1.3.8
                        cluster Id | leofs_1
                             DC Id | dc_1
                    Total replicas | 3
          number of successes of R | 1
          number of successes of W | 1
          number of successes of D | 1
 number of rack-awareness replicas | 0
                         ring size | 2^128
-----------------------------------+----------
 Multi DC replication settings
-----------------------------------+----------
 [mdcr] max number of joinable DCs | 2
 [mdcr] total replicas per a DC    | 1
 [mdcr] number of successes of R   | 1
 [mdcr] number of successes of W   | 1
 [mdcr] number of successes of D   | 1
-----------------------------------+----------
 Manager RING hash
-----------------------------------+----------
                 current ring-hash | a01acd3c
                previous ring-hash | a01acd3c
-----------------------------------+----------

 [State of Node(s)]
-------+-------------------------+--------------+----------------+----------------+----------------------------
 type  |          node           |    state     |  current ring  |   prev ring    |          updated at         
-------+-------------------------+--------------+----------------+----------------+----------------------------
  S    | node01@10.3.15.101      | running      | a01acd3c       | a01acd3c       | 2018-07-20 18:12:23 +0300
  S    | node05@10.3.15.105      | running      | a01acd3c       | a01acd3c       | 2018-07-20 18:13:46 +0300
  S    | node06@10.3.15.106      | running      | a01acd3c       | a01acd3c       | 2018-07-20 18:13:52 +0300
  S    | node07@10.3.15.107      | running      | a01acd3c       | a01acd3c       | 2018-07-20 18:14:19 +0300
  S    | node08@10.3.15.108      | running      | a01acd3c       | a01acd3c       | 2018-07-20 18:14:24 +0300
  S    | node09@10.3.15.109      | running      | a01acd3c       | a01acd3c       | 2018-07-20 18:14:45 +0300
  G    | gw01@10.3.15.11         | running      | a01acd3c       | a01acd3c       | 2018-07-20 18:15:20 +0300
  G    | gw02@10.3.15.12         | running      | a01acd3c       | a01acd3c       | 2018-07-20 18:15:25 +0300
  G    | gw05@10.3.15.15         | running      | a01acd3c       | a01acd3c       | 2018-07-20 18:15:30 +0300
-------+-------------------------+--------------+----------------+----------------+----------------------------
--------------------------------------+--------------------------------------
                Item                  |                 Value                
--------------------------------------+--------------------------------------
 Config-1: basic
--------------------------------------+--------------------------------------
                              version | 1.3.8
                     number of vnodes | 168
                    object containers | - path:[/mnt/disk0], # of containers:64
                                                  | - path:[/mnt/disk1], # of containers:64
                                                  | - path:[/mnt/disk2], # of containers:64
                                                  | - path:[/mnt/disk3], # of containers:64
                                                  | - path:[/mnt/disk4], # of containers:64
                                                  | - path:[/mnt/disk5], # of containers:64
                        log directory | ./log/erlang
                            log level | info
--------------------------------------+--------------------------------------
 Config-2: watchdog
--------------------------------------+--------------------------------------
 [rex(rpc-proc)]                      |
                    check interval(s) | 10
               threshold mem capacity | 33554432
--------------------------------------+--------------------------------------
 [cpu]                                |
                     enabled/disabled | disabled
                    check interval(s) | 10
               threshold cpu load avg | 5.0
                threshold cpu util(%) | 100
--------------------------------------+--------------------------------------
 [disk]                               |
                     enabled/disalbed | disabled
                    check interval(s) | 10
                threshold disk use(%) | 85
               threshold disk util(%) | 90
                    threshold rkb(kb) | 98304
                    threshold wkb(kb) | 98304
--------------------------------------+--------------------------------------
 Config-3: message-queue
--------------------------------------+--------------------------------------
                   number of procs/mq | 4
        number of batch-procs of msgs | max:3000, regular:1600
   interval between batch-procs (ms)  | max:3000, regular:500
--------------------------------------+--------------------------------------
 Config-4: autonomic operation
--------------------------------------+--------------------------------------
 [auto-compaction]                    |
                     enabled/disabled | disabled
        warning active size ratio (%) | 70
      threshold active size ratio (%) | 60
             number of parallel procs | 1
                        exec interval | 3600
--------------------------------------+--------------------------------------
 Config-5: data-compaction
--------------------------------------+--------------------------------------
  limit of number of compaction procs | 2
        number of batch-procs of objs | max:1500, regular:1000
   interval between batch-procs (ms)  | max:3000, regular:500
--------------------------------------+--------------------------------------
 Status-1: RING hash
--------------------------------------+--------------------------------------
                    current ring hash | a01acd3c
                   previous ring hash | a01acd3c
--------------------------------------+--------------------------------------
 Status-2: Erlang VM
--------------------------------------+--------------------------------------
                           vm version | 8.3.5.2
                      total mem usage | 97940176
                     system mem usage | 46387024
                      procs mem usage | 51768792
                        ets mem usage | 6188632
                                procs | 3586/1048576
                          kernel_poll | true
                     thread_pool_size | 32
--------------------------------------+--------------------------------------
 Status-3: Number of messages in MQ
--------------------------------------+--------------------------------------
                 replication messages | 10196596
                  vnode-sync messages | 0
                   rebalance messages | 0
--------------------------------------+--------------------------------------
mocchira commented 6 years ago

@vstax Thanks. It seems no problems around RING. Anyway, I've just remembered there has been a drawback in increasing num_of_containers while it has several advantages like decreasing the AVS file size, minimizing the compaction time etc.

Increasing num_of_containers results in increasing the number of erlang processes (You can think of it as the degree of concurrency) dealing with AVS operations, OTOH leo_redundant_manager/leo_redundant_manager_worker dealing with RING operations like get_redundancies_by_key keeps being one erlang process. That being said, as num_of_containers increases, the possibility of timeout/error happening on RING operations can increase.

Can you run leofs_doctor on node01 to check if the above assumption is correct. (if it's correct then leo_redundant_manager or leo_redundat_manager_worker's queue size has gotten a large value)

mocchira commented 6 years ago

@vstax if increased num_of_contianers is the culprit then you may decrease its value to more lower one to solve this problem.

vstax commented 6 years ago

@mocchira Thanks for looking.

Actually now that I checked, there are 100% CPU spikes in Eleveldb thread for a second or two every 5-7 seconds or so:

top - 04:23:18 up 9 days, 12:09,  1 user,  load average: 0,56, 0,56, 0,53
Threads: 134 total,   1 running, 133 sleeping,   0 stopped,   0 zombie
%Cpu(s):  9,3 us,  0,9 sy,  0,0 ni, 89,8 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem : 32664668 total,   275880 free,  1816068 used, 30572720 buff/cache
KiB Swap:  3092476 total,  3092476 free,        0 used. 30294588 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                  
 6352 leofs     20   0   13,4g   1,4g  25972 R 99,9  4,5 842:48.24 Eleveldb                                                                 
 6295 leofs     20   0   13,4g   1,4g  25972 S  5,9  4,5   1504:18 1_scheduler                                                              
 6262 leofs     20   0   13,4g   1,4g  25972 S  0,7  4,5   2:56.85 async_1                                                                  
 6114 leofs     20   0   13,4g   1,4g  25972 S  0,0  4,5   0:00.08 beam.smp                                                                 
 6260 leofs     20   0   13,4g   1,4g  25972 S  0,0  4,5   0:00.00 sys_sig_dispatc                                                          
 6261 leofs     20   0   13,4g   1,4g  25972 S  0,0  4,5   0:00.00 sys_msg_dispatc                                                          
 6263 leofs     20   0   13,4g   1,4g  25972 S  0,0  4,5   2:52.02 async_2                                                                  

However, these are present on all nodes in cluster. Still, it's somewhat strange.

Anyhow, leo_doctor output for problematic node node01: https://pastebin.com/0WxpQF4p

And for node05 which is under recovery atm (as in recover-node node05) - CPU spikes in Eleveldb happen there as well, actually they might be even worse : https://pastebin.com/0u3a7bnP No problems in logs of node05, as usual.

And for node09, which is running on weaker configuration under VM (once again, no problems in logs): https://pastebin.com/e9BhE0zx

Queues are empty on all of these nodes. The rest of nodes - 06, 07 and 08 have non-zero queues as they're pushing data to node05. But node01 and node09 currently don't participate in any such operations.

From what I understand, the load at that moment was quite low (I don't quite have numbers, quick tcpdump on gateway is showing something among 20 reads per second and almost no writes at all). I think health check (/_leofs_adm/ping) was misconfigured and is requested 2-3 times per second, not sure if it can create a problem or not.

Another thing I noticed is that "telegraf" is running on each node and gathering stats over SNMP locally, doing it rather often (every second). Not sure if it could be the source of the problem, but it's configured that way on each node as well, so most likely it is not...

if increased num_of_contianers is the culprit then you may decrease its value to more lower one to solve this problem.

Well, it's an option, but I have doubts about two things. First of all, I managed another cluster in the past and it was running perfectly fine with 4 64 containers, no such problems even during intense HEAD+PUT operations during data migration, despite somewhat weaker hardware (4-core Xeon E3). Here node01 has 6 64 containers, a bit more but the load is really low and hardware is better.

Second is these Eleveldb CPU spikes, something just doesn't feel right about them. Maybe they are original source and in addition to something else (e.g. leo_redundant_manager like you mention) create problems for node01? Is it possible to understand source of these spikes?

EDIT: One more thing, I've launched recover-node node06 and observing high load on node01 (and the other nodes). It's completely different from CPU spikes above; most of CPU time (out of 200-300% total CPU usage) is spent in scheduler threads, and multiple Eleveldb are under load. In CPU spikes problem above, it's always just one Eleveldb thread hogging up a single core:

top - 10:55:04 up 9 days, 18:41,  1 user,  load average: 4,04, 4,23, 4,32
Threads: 134 total,   3 running, 131 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12,0 us,  1,9 sy,  0,0 ni, 76,7 id,  9,3 wa,  0,0 hi,  0,1 si,  0,0 st
KiB Mem : 32664668 total,   226488 free,  1813384 used, 30624796 buff/cache
KiB Swap:  3092476 total,  3092476 free,        0 used. 30287960 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                  
 6298 leofs     20   0   15,2g   1,4g  20724 R 32,9  4,4 792:43.22 4_scheduler                                                              
 6300 leofs     20   0   15,2g   1,4g  20724 R 27,2  4,4  20:06.99 6_scheduler                                                              
 6297 leofs     20   0   15,2g   1,4g  20724 S 24,9  4,4   1325:50 3_scheduler                                                              
 6299 leofs     20   0   15,2g   1,4g  20724 R 24,6  4,4 233:43.40 5_scheduler                                                              
 6332 leofs     20   0   15,2g   1,4g  20724 S 20,3  4,4 325:57.96 Eleveldb                                                                 
 6389 leofs     20   0   15,2g   1,4g  20724 S 16,3  4,4   2:24.47 Eleveldb                                                                 
 6296 leofs     20   0   15,2g   1,4g  20724 S  7,6  4,4   1488:26 2_scheduler                                                              
 6307 leofs     20   0   15,2g   1,4g  20724 S  3,0  4,4 206:46.35 aux                                                                      
 6295 leofs     20   0   15,2g   1,4g  20724 S  1,3  4,4   1649:59 1_scheduler                                                              
 6339 leofs     20   0   15,2g   1,4g  20724 S  1,0  4,4 628:31.24 Eleveldb                                                                 
 6396 leofs     20   0   15,2g   1,4g  20724 S  1,0  4,4  85:49.36 Eleveldb                                                                 
mocchira commented 6 years ago

@vstax Thanks for the quick reply with the result of leofs_doctor on several nodes. It seems there are no problems at all. so my assumption was wrong.

Second is these Eleveldb CPU spikes, something just doesn't feel right about them. Maybe they are original source and in addition to something else (e.g. leo_redundant_manager like you mention) create problems for node01? Is it possible to understand source of these spikes?

Yes this can be the original source problem so I will dig into this from now on.

EDIT: One more thing, I've launched recover-node node06 and observing high load on node01 (and the other nodes). It's completely different from CPU spikes above; most of CPU time (out of 200-300% total CPU usage) is spent in scheduler threads, and multiple Eleveldb are under load. In CPU spikes problem above, it's always just one Eleveldb thread hogging up a single core:

The queue filling up phase happening right after recover-node get started tend to hog only one OS thread (iterating eleveldb to find objects to be sent to the target node would be executed one by one), OTOH the queue consuming phase being started right after the filling up phase get finished tend to hog as many scheduler threads as possible. this behavior might explain what is happening on your cluster now.

vstax commented 6 years ago

@mocchira

The queue filling up phase happening right after recover-node get started tend to hog only one OS thread

I see. None of the nodes were filling queues in any of the above examples, though. node01 and node09 weren't doing anything with the queues and node05 was only accepting data pushed by queues from other nodes. I don't know if processing of recovery queue on e.g. node07 which is pushing data to node05 can cause any kind of load on node01 (which is long done with its recovery queue), though. I wish I could check if CPU spikes happen in the absence of recovery going on at all, but it's not likely to happen (14-24 mil of messages in recovery queues right now as the next "recover-node" was launched some time ago).

In the worst case, if you will be unable to find solution to improve the current situation, do you think that recreating node01 (maybe with a new name) will fix all the problems (which can be done in 3-4 days from now on, I think, after recover-node is finished for the other 5 nodes)?

mocchira commented 6 years ago

@vstax

I don't know if processing of recovery queue on e.g. node07 which is pushing data to node05 can cause any kind of load on node01 (which is long done with its recovery queue), though.

Processing of the recovery queue on node07 which is pushing data to node05 (led by leofs-adm recover node05) shouldn't cause any kind of load on node01. (only node07 and node05 can be affected)

I wish I could check if CPU spikes happen in the absence of recovery going on at all, but it's not likely to happen (14-24 mil of messages in recovery queues right now as the next "recover-node" was launched some time ago).

I see. please check it out once the every recover-node operation has finished.

In the worst case, if you will be unable to find solution to improve the current situation, do you think that recreating node01 (maybe with a new name) will fix all the problems (which can be done in 3-4 days from now on, I think, after recover-node is finished for the other 5 nodes)?

Yes I think so (actually I was going to ask you to do so if I couldn't find any solution within a certain period) just in case, If you do this then I'd like you to set obj.num_of_containers back to the original value (16 * 6) for safe. (At this moment, I'm still not 100% sure that obj.num_of_containers has affected this problem but at least I believe there is a possibility that it can affect.)

Anyway, I'm still trying to reproduce this problem on my dev-box and if I succeeds to reproduce and find out the root problem and its solution then I'll get back to you at here.

vstax commented 6 years ago

@mocchira I've tried to check what these high-CPU consumption Elevedb threads are doing with strace. Not perfect, I know, and it's hard to make anything out of this output, but maybe you will have some ideas about what might be going on (or more like, if there is something strange in there or not)?

Here is trace from node06 which is not participating in any recovery, queues are all at zero. The load on cluster is around 50-60 GETs/sec. But just by looking at strace statistics (gathered in 20 seconds) - and that's just for the single Eleveldb thread out of few (2-3 or so) that consume lots of CPU (it's just one consuming the whole core at a single moment, but it can switch to another one every 20-40 seconds or so):

------ ----------- ----------- --------- --------- ----------------
 91.98    0.824162         328      2514        82 futex
  7.68    0.068824           2     45085           pread64
  0.29    0.002634           6       439           write
  0.02    0.000201           4        50           open
  0.01    0.000112           2        50           fadvise64
  0.01    0.000108           2        50           close
------ ----------- ----------- --------- --------- ----------------
100.00    0.896041                 48188        82 total

That's over 2000 reads per second (generally all from .sst files of AVS). I have no idea if that's how leveldb works or this number is too high.

Strace statistics for node01 (not participating in recovery and queues are at zero as well) is very similar:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 76.48    0.228918         213      1076        79 futex
 23.22    0.069486           2     31439           pread64
  0.19    0.000561           2       358           write
  0.07    0.000214           4        49           open
  0.03    0.000083           2        49           close
  0.02    0.000046           1        49           fadvise64
------ ----------- ----------- --------- --------- ----------------
100.00    0.299308                 33020        79 total

(the numbers of reads can vary in this 20 sec run in range like 10,000-50,000 depending on the luck of catching Eleveldb thread that's actually doing the work. I haven't seen more than 3 of these per node, though)

Full strace outputs for node01 and node07, also lsof output for node01: strace.tar.gz

By the way, queue .sst files are opened/closed quite a lot.

vstax commented 6 years ago

@mocchira Some good news, one of the problems is gone! (also feel free to ignore strace output from my previous comment as I made a better one here).

Since recovery was finished for 5 nodes, I was trying some experiments that might show the nature of the problem. I suspended and stopped node01 (it had zero queues at that moment) and wiped all the data from AVS. Launched it without data (same settings - 6*64 AVS files), resumed and the errors started to appear right away:

[E]     node01@10.3.15.101      2018-07-26 23:36:50.863832 +0300        1532637410leo_storage_handler_object:get/4      190     [{from,gateway},{method,get},{key,<<"production/promotion/261/1.jpg">>},{req_id,31364649},{etag,18791873666153085349492137587707139197},{cause,unavailable}]
$ leofs-adm whereis production/promotion/261/1.jpg
-------+-------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |          node           |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
-------+-------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | node01@10.3.15.101      |                                      |            |              |                |                |                |
       | node05@10.3.15.105      | dc5342106c16c5415fe8571b8fe6465a     |        62K |   e232ee7898 | false          |              0 | 56f88c2b18a22  | 2018-06-26 13:14:16 +0300
       | node08@10.3.15.108      | dc5342106c16c5415fe8571b8fe6465a     |        62K |   e232ee7898 | false          |              0 | 56f88c2b18a22  | 2018-06-26 13:14:16 +0300

$ leofs-adm du node01@10.3.15.101
 active number of objects: 123
  total number of objects: 123
   active size of objects: 10844493
    total size of objects: 10844493
     ratio of active size: 100.0%
    last compaction start: ____-__-__ __:__:__
      last compaction end: ____-__-__ __:__:__

I wanted to check if the errors are related to some problem with metadata on node01.. looks like they aren't as they happen when the objects aren't even present there. Interestingly, read-repair also didn't work for the objects that produced these errors! It works for the objects that are requested without errors just fine, though (here all these 123 objects on node are from read-repair, obviously).

Then I wiped everything once again and started node with 6 16 containers. No errors! Your guess about 6 64 containers being problematic was quite right. Removed and started with 6 * 32 containers. No errors so far. So it looks like some kind of problem appears only when total number of containers is greater than 192 or maybe 256 (what I used in the other cluster), and 384 containers was enough here to appear. Anyhow, unless errors return again, this also lowers priority of this problem (still, I'm kind of interested in knowing what was the original reason for this one!)

# Now regarding the second problem, spiky CPU usage in Eleveldb thread - which turned out to be unrelated to the errors on node01.

Right after node was launched without data, these 100% CPU spikes appeared again, despite node holding almost no objects. I suspended node01 from manager and they still happened! Which seemed quite strange. Here is strace and lsof of Elevedb thread from node01 with wiped data, holding only about 200 objects and being in "suspend" state: strace-lsof-node01.tar.gz

So it turns out that it's reading .sst files from queue/1 like crazy even though there is nothing in queues:

# leofs-adm mq-stats node01@10.3.15.101
              id                |    state    | number of msgs | batch of msgs  |    interval    |                 description
--------------------------------+-------------+----------------|----------------|----------------|---------------------------------------------
 leo_async_deletion_queue       |   idling    | 0              | 1600           | 500            | async deletion of objs
 leo_comp_meta_with_dc_queue    |   idling    | 0              | 1600           | 500            | compare metadata w/remote-node
 leo_delete_dir_queue_1         |   idling    | 0              | 1600           | 500            | deletion bucket #1
 leo_delete_dir_queue_2         |   idling    | 0              | 1600           | 500            | deletion bucket #2
 leo_delete_dir_queue_3         |   idling    | 0              | 1600           | 500            | deletion bucket #3
 leo_delete_dir_queue_4         |   idling    | 0              | 1600           | 500            | deletion bucket #4
 leo_delete_dir_queue_5         |   idling    | 0              | 1600           | 500            | deletion bucket #5
 leo_delete_dir_queue_6         |   idling    | 0              | 1600           | 500            | deletion bucket #6
 leo_delete_dir_queue_7         |   idling    | 0              | 1600           | 500            | deletion bucket #7
 leo_delete_dir_queue_8         |   idling    | 0              | 1600           | 500            | deletion bucket #8
 leo_per_object_queue           |   idling    | 0              | 1600           | 500            | recover inconsistent objs
 leo_rebalance_queue            |   idling    | 0              | 1600           | 500            | rebalance objs
 leo_recovery_node_queue        |   idling    | 0              | 1600           | 500            | recovery objs of node
 leo_req_delete_dir_queue       |   idling    | 0              | 1600           | 500            | request removing directories
 leo_sync_by_vnode_id_queue     |   idling    | 0              | 1600           | 500            | sync objs by vnode-id
 leo_sync_obj_with_dc_queue     |   idling    | 0              | 1600           | 500            | sync objs w/remote-node
[root@leofs-node01 app]# du -sm /usr/local/leofs/1.3.8/leo_storage/work/queue/1/*
0       /usr/local/leofs/1.3.8/leo_storage/work/queue/1/message0
203     /usr/local/leofs/1.3.8/leo_storage/work/queue/1/message0_63699318745
0       /usr/local/leofs/1.3.8/leo_storage/work/queue/1/message1
248     /usr/local/leofs/1.3.8/leo_storage/work/queue/1/message1_63699318745
0       /usr/local/leofs/1.3.8/leo_storage/work/queue/1/message2
235     /usr/local/leofs/1.3.8/leo_storage/work/queue/1/message2_63699318745
0       /usr/local/leofs/1.3.8/leo_storage/work/queue/1/message3
25      /usr/local/leofs/1.3.8/leo_storage/work/queue/1/message3_63699318745
[root@leofs-node01 ~]# LANG=C ll /usr/local/leofs/1.3.8/leo_storage/work/queue/1/message2_63699318745/sst_4
total 61404
-rw-r--r-- 1 leofs leofs  1528036 Jul 24 08:25 000617.sst
-rw-r--r-- 1 leofs leofs  7945841 Jul 25 04:13 000752.sst
-rw-r--r-- 1 leofs leofs   541527 Jul 25 07:48 000764.sst
-rw-r--r-- 1 leofs leofs  3944252 Jul 25 07:49 000767.sst
-rw-r--r-- 1 leofs leofs 16648218 Jul 25 10:30 000770.sst
-rw-r--r-- 1 leofs leofs  5820737 Jul 25 10:30 000771.sst
-rw-r--r-- 1 leofs leofs 16199070 Jul 25 12:02 000794.sst
-rw-r--r-- 1 leofs leofs 10234646 Jul 25 12:02 000795.sst

All the disk usage is in .sst files, from what I understand they should really contain no data, only some removed stuff that wasn't compacted. But every few seconds or so Eleveldb thread is trying to read them and does something with what is read (uses lots of CPU in process). Note that these queues were wiped just a few days ago when I was trying to fix launch of node01.

I think this problem might also be the reason why recover-node (second stage) was quite slow and consuming insane amounts of CPU in locks instead of being disk or network-bound (again, compared to how I've seen recover-node to work on the different cluster and how this cluster is configured).

mocchira commented 6 years ago

@vstax

Some good news, one of the problems is gone! (also feel free to ignore strace output from my previous comment as I made a better one here).

Great to hear that! and I see regarding the prev comment.

Then I wiped everything once again and started node with 6 16 containers. No errors! Your guess about 6 64 containers being problematic was quite right. Removed and started with 6 * 32 containers. No errors so far. So it looks like some kind of problem appears only when total number of containers is greater than 192 or maybe 256 (what I used in the other cluster), and 384 containers was enough here to appear. Anyhow, unless errors return again, this also lowers priority of this problem (still, I'm kind of interested in knowing what was the original reason for this one!)

Now I'm suspecting that some lock contention has occurred because of scrambling for some shared resources between underlying leveldb instances. This rarely happens in case num_of_containers is relatively small however it starts to happen once num_of_containers goes above a certain threshold (192 or 256 as you mentioned). I will look into what the actual resource is next week.

Right after node was launched without data, these 100% CPU spikes appeared again, despite node holding almost no objects. I suspended node01 from manager and they still happened! Which seemed quite strange. Here is strace and lsof of Elevedb thread from node01 with wiped data, holding only about 200 objects and being in "suspend" state: strace-lsof-node01.tar.gz

https://github.com/leo-project/leofs/issues/1060 might be the culprit (there are many queues and each queue has its own polling erlang process) so this might be fixed once your cluster get upgraded to >= 1.4.2.

mocchira commented 6 years ago

Now it turns out that there have been three problems in the original issue filed on #1076 .

  1. Manager's RING got broken
  2. A large num_of_containers setting can cause leo_storage to dump many errors
  3. CPU spike by eleveldb (and leveldb) threads

I'm going to deal with the first problem here, the second problem here: https://github.com/leo-project/leofs/issues/1086. I think the last one will probably be solved by #1060 as I mentioned at the above comment so I don't file any issue at the moment however if you still face the issue after upgrading to >= 1.4.2 then I'd like to file the issue for that (or please file by yourself)

From now on, let's discuss each topic on the appropriate issue.