openvstorage / framework

The Framework is a set of components and tools which brings the user an interface (GUI / API) to setup, extend and manage an Open vStorage platform.
Other
27 stars 23 forks source link

backend abm: Unlink of "/mnt/ssd1/arakoon/sas-back-abm/db/touch-20625" failed with ENOENT #2315

Open yongshengma opened 5 years ago

yongshengma commented 5 years ago

Hello,

ovs-arakoon-sas-back-abm.service keeps on activating but it cannot start up. It has been like for a few weeks. Just wondering whats' going on:

-- Logs begin at Tue 2019-07-09 13:59:34 CST, end at Tue 2019-07-09 13:59:58 CST. --
Jul 09 13:59:39 node3 systemd[1]: ovs-arakoon-sas-back-abm.service holdoff time over, scheduling restart.
Jul 09 13:59:39 node3 systemd[1]: Starting Arakoon service for cluster sas-back-abm...
Jul 09 13:59:39 node3 systemd[1]: Started Arakoon service for cluster sas-back-abm.
Jul 09 13:59:46 node3 systemd[1]: ovs-arakoon-sas-back-abm.service: main process exited, code=exited, status=41/n/a
Jul 09 13:59:46 node3 systemd[1]: Unit ovs-arakoon-sas-back-abm.service entered failed state.
Jul 09 13:59:46 node3 systemd[1]: ovs-arakoon-sas-back-abm.service failed.
Jul 09 13:59:51 node3 systemd[1]: ovs-arakoon-sas-back-abm.service holdoff time over, scheduling restart.
Jul 09 13:59:51 node3 systemd[1]: Starting Arakoon service for cluster sas-back-abm...
Jul 09 13:59:52 node3 systemd[1]: Started Arakoon service for cluster sas-back-abm.
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 041391 +0800 - node3 - 20625/0000 - arakoon - 0 - info - --- NODE STARTED ---
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 042718 +0800 - node3 - 20625/0000 - arakoon - 1 - info - git_revision: tags/1.9.22-0-gd4f2572-dirty
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 042735 +0800 - node3 - 20625/0000 - arakoon - 2 - info - compile_time: 08/12/2017 01:32:14 UTC
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 042744 +0800 - node3 - 20625/0000 - arakoon - 3 - info - version: 1.9.22
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 042754 +0800 - node3 - 20625/0000 - arakoon - 4 - info - NOFILE: 8192
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 042843 +0800 - node3 - 20625/0000 - arakoon - 5 - info - cluster_cfg={ cfgs = [{ node_name = "XHGfO1fHVDeRKHxa"; ips = ["192.168.0.34"]; client_port = 26408; messaging_port = 26409; home = "/mnt/ssd1/arakoon/sas-back-abm/db"; tlog_dir = "/mnt/ssd1/arakoon/sas-back-abm/tlogs"; log_sinks = [Console]; crash_log_sinks = [File(console:)]; tlx_dir = "/mnt/ssd1/arakoon/sas-back-abm/tlogs"; head_dir = "/mnt/ssd1/arakoon/sas-back-abm/tlogs"; log_level = "info"; log_config = None; batched_transaction_config = None; lease_period = 10; master = "Elected"; is_laggy = false; is_learner = false; is_witness = false; targets = []; compressor = Snappy; fsync = true; is_test = false; reporting = 300; _fsync_tlog_dir = true; node_tls = None; collapse_slowdown = None; head_copy_throttling = 0.; optimize_db_slowdown = 0. }; { node_name = "3MHjJSpuV1GRFJPJ"; ips = ["192.168.0.33"]; client_port = 26408; messaging_port = 26409; home = "/mnt/ssd1/arakoon/sas-back-abm/db"; tlog_dir = "/mnt/ssd1/arakoon/sas-back-abm/tlogs"; log_sinks = [Console]; crash_log_sinks = [File(console:)]; tlx_dir = "/mnt/ssd1/arakoon/sas-back-abm/tlogs"; head_dir = "/mnt/ssd1/arakoon/sas-back-abm/tlogs"; log_level = "info"; log_config = None; batched_transaction_config = None; lease_period = 10; master = "Elected"; is_laggy = false; is_learner = false; is_witness = false; targets = []; compressor = Snappy; fsync = true; is_test = false; reporting = 300; _fsync_tlog_dir = true; node_tls = None; collapse_slowdown = None; head_copy_throttling = 0.; optimize_db_slowdown = 0. }; { node_name = "IqXn44EwnGDSnQrd"; ips = ["192.168.0.35"]; client_port = 26408; messaging_port = 26409; home = "/mnt/ssd1/arakoon/sas-back-abm/db"; tlog_dir = "/mnt/ssd1/arakoon/sas-back-abm/tlogs"; log_sinks = [Console]; crash_log_sinks = [File(console:)]; tlx_dir = "/mnt/ssd1/arakoon/sas-back-abm/tlogs"; head_dir = "/mnt/ssd1/arakoon/sas-back-abm/tlogs"; log_level = "info"; log_config = None; batched_transaction_config = None; lease_period = 10; master = "Elected"; is_laggy
Jul 09 13:59:52 node3 arakoon[20625]: = false; is_learner = false; is_witness = false; targets = []; compressor = Snappy; fsync = true; is_test = false; reporting = 300; _fsync_tlog_dir = true; node_tls = None; collapse_slowdown = None; head_copy_throttling = 0.; optimize_db_slowdown = 0. }]; log_cfgs = []; batched_transaction_cfgs = []; _master = Elected; _lease_period = 10; cluster_id = "sas-back-abm"; plugins = ["albamgr_plugin"]; nursery_cfg = None; tlog_max_entries = 5000; tlog_max_size = 33554432; max_value_size = 8388608; max_buffer_size = 33554432; client_buffer_capacity = 32; lcnum = 16384; ncnum = 8192; tls = None }
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 042859 +0800 - node3 - 20625/0000 - arakoon - 6 - info - Batched_store.max_entries = 100
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 042866 +0800 - node3 - 20625/0000 - arakoon - 7 - info - Batched_store.max_size = 100000
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 042877 +0800 - node3 - 20625/0000 - arakoon - 8 - info - autofix:true
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 042885 +0800 - node3 - 20625/0000 - arakoon - 9 - info - loading plugin albamgr_plugin
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 042893 +0800 - node3 - 20625/0000 - arakoon - 10 - info - qualified as: /mnt/ssd1/arakoon/sas-back-abm/db/albamgr_plugin.cmxs
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 058320 +0800 - node3 - 20625/0000 - arakoon - 11 - info - albamgr_plugin (1,3,25) git_revision:heads/master-0-ge43faca-dirty
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 058358 +0800 - node3 - 20625/0000 - arakoon - 12 - info - Cluster not part of nursery.
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 058408 +0800 - node3 - 20625/0000 - arakoon - 13 - info - cfg = { node_name = "IqXn44EwnGDSnQrd"; ips = ["192.168.0.35"]; client_port = 26408; messaging_port = 26409; home = "/mnt/ssd1/arakoon/sas-back-abm/db"; tlog_dir = "/mnt/ssd1/arakoon/sas-back-abm/tlogs"; log_sinks = [Console]; crash_log_sinks = [File(console:)]; tlx_dir = "/mnt/ssd1/arakoon/sas-back-abm/tlogs"; head_dir = "/mnt/ssd1/arakoon/sas-back-abm/tlogs"; log_level = "info"; log_config = None; batched_transaction_config = None; lease_period = 10; master = "Elected"; is_laggy = false; is_learner = false; is_witness = false; targets = []; compressor = Snappy; fsync = true; is_test = false; reporting = 300; _fsync_tlog_dir = true; node_tls = None; collapse_slowdown = None; head_copy_throttling = 0.; optimize_db_slowdown = 0. }
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 058486 +0800 - node3 - 20625/0000 - arakoon - 14 - info - other: XHGfO1fHVDeRKHxa
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 058505 +0800 - node3 - 20625/0000 - arakoon - 15 - info - other: 3MHjJSpuV1GRFJPJ
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 058517 +0800 - node3 - 20625/0000 - arakoon - 16 - info - quorum_function gives 2 for 3
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 058527 +0800 - node3 - 20625/0000 - arakoon - 17 - info - DAEMONIZATION=false
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 058543 +0800 - node3 - 20625/0000 - arakoon - 18 - info - open_tlog_collection_and_store ~autofix:true
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 058566 +0800 - node3 - 20625/0000 - arakoon - 19 - info - Unlinking "/mnt/ssd1/arakoon/sas-back-abm/db/touch-20625"
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 058620 +0800 - node3 - 20625/0000 - arakoon - 20 - info - Unlink of "/mnt/ssd1/arakoon/sas-back-abm/db/touch-20625" failed with ENOENT
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 059048 +0800 - node3 - 20625/0000 - arakoon - 21 - info - Unlinking "/mnt/ssd1/arakoon/sas-back-abm/db/touch-20625"
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 059132 +0800 - node3 - 20625/0000 - arakoon - 22 - info - Unlinking "/mnt/ssd1/arakoon/sas-back-abm/tlogs/touch-20625"
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 059171 +0800 - node3 - 20625/0000 - arakoon - 23 - info - Unlink of "/mnt/ssd1/arakoon/sas-back-abm/tlogs/touch-20625" failed with ENOENT
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 059915 +0800 - node3 - 20625/0000 - arakoon - 24 - info - Unlinking "/mnt/ssd1/arakoon/sas-back-abm/tlogs/touch-20625"
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 059983 +0800 - node3 - 20625/0000 - arakoon - 25 - info - Unlinking "/mnt/ssd1/arakoon/sas-back-abm/tlogs/touch-20625"
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 060023 +0800 - node3 - 20625/0000 - arakoon - 26 - info - Unlink of "/mnt/ssd1/arakoon/sas-back-abm/tlogs/touch-20625" failed with ENOENT
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 060686 +0800 - node3 - 20625/0000 - arakoon - 27 - info - Unlinking "/mnt/ssd1/arakoon/sas-back-abm/tlogs/touch-20625"
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 060748 +0800 - node3 - 20625/0000 - arakoon - 28 - info - Unlinking "/mnt/ssd1/arakoon/sas-back-abm/tlogs/touch-20625"
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 060772 +0800 - node3 - 20625/0000 - arakoon - 29 - info - Unlink of "/mnt/ssd1/arakoon/sas-back-abm/tlogs/touch-20625" failed with ENOENT
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 061426 +0800 - node3 - 20625/0000 - arakoon - 30 - info - Unlinking "/mnt/ssd1/arakoon/sas-back-abm/tlogs/touch-20625"
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 061525 +0800 - node3 - 20625/0000 - arakoon - 31 - info - copy_file /mnt/ssd1/arakoon/sas-back-abm/tlogs/head.db /mnt/ssd1/arakoon/sas-back-abm/db/IqXn44EwnGDSnQrd.db (overwrite=false,throttling=0.000000) buffer_size:1048576
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 061568 +0800 - node3 - 20625/0000 - arakoon - 32 - info - Not copying /mnt/ssd1/arakoon/sas-back-abm/tlogs/head.db to /mnt/ssd1/arakoon/sas-back-abm/db/IqXn44EwnGDSnQrd.db because target already exists
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 061594 +0800 - node3 - 20625/0000 - arakoon - 33 - info - _init ~check_marker:true ~check_sabotage:true
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 064886 +0800 - node3 - 20625/0000 - arakoon - 34 - info - tlog_number:37613
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 648783 +0800 - node3 - 20625/0000 - arakoon - 35 - warning - AUTOFIX: improperly closed tlog: Tlog_map.TLCNotProperlyClosed(_)
Jul 09 13:59:52 node3 arakoon[20625]: 2019-07-09 13:59:52 648828 +0800 - node3 - 20625/0000 - arakoon - 36 - info - Unlinking "/mnt/ssd1/arakoon/sas-back-abm/db/IqXn44EwnGDSnQrd.db"
Jul 09 13:59:52 node3 arakoon[20625]: 188047351:0
Jul 09 13:59:52 node3 arakoon[20625]: 188047352:146
Jul 09 13:59:52 node3 arakoon[20625]: 188047353:289
Jul 09 13:59:52 node3 arakoon[20625]: 188047354:338

Best regards, Yongsheng

toolslive commented 5 years ago

If this is a 3 (or more) node cluster, and it's just this node that's acting up, and it has been for a while now, it means that the other nodes in the cluster can continue without this one and this one is severely lagging anyway. The simplest thing to do is to wipe the local contents (make the database dir and the tlog dir empty) and only then restart it. It will fill itself using the others and will start participating in elections only after that.

If there's a large number of tlogs, you might want to run a collapse on the other nodes first (this is probably done once a day by the framework)

yongshengma commented 5 years ago

This is a 3-node cluster. I checked the other nodes. The tlogs on this node are really lagging behind a lot. Here's a look please.

[root@node3 tlogs]# ll -th 
total 583M
-rw-r--r-- 1 ovs ovs  81M Jul 10 11:16 37613.tlog
-rw-r--r-- 1 ovs ovs 3.5M Jun 21 17:44 37612.tlx
-rw-r--r-- 1 ovs ovs 3.6M Jun 21 16:16 37611.tlx
-rw-r--r-- 1 ovs ovs 3.7M Jun 21 14:47 37610.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 21 13:16 37609.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 21 11:48 37608.tlx
-rw-r--r-- 1 ovs ovs 1.9M Jun 21 10:27 37607.tlx
-rw-r--r-- 1 ovs ovs 2.8M Jun 21 09:48 37606.tlx
-rw-r--r-- 1 ovs ovs 2.5M Jun 21 08:13 37605.tlx
-rw-r--r-- 1 ovs ovs 2.5M Jun 21 06:31 37604.tlx
-rw-r--r-- 1 ovs ovs 2.7M Jun 21 04:46 37603.tlx
-rw-r--r-- 1 ovs ovs 3.2M Jun 21 03:12 37602.tlx
-rw-r--r-- 1 ovs ovs 3.2M Jun 21 01:44 37601.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 21 00:16 37600.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 22:48 37599.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 21:20 37598.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 19:52 37597.tlx
-rw-r--r-- 1 ovs ovs 3.2M Jun 20 18:24 37596.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 16:58 37595.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 15:30 37594.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 14:02 37593.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 12:35 37592.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 11:07 37591.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 09:39 37590.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 08:12 37589.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 06:44 37588.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 05:16 37587.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 03:48 37586.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 02:20 37585.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 00:52 37584.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 19 23:24 37583.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 19 21:55 37582.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 20:27 37581.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 19:00 37580.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 17:33 37579.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 16:05 37578.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 14:37 37577.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 13:09 37576.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 11:41 37575.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 10:14 37574.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 08:46 37573.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 07:18 37572.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 05:50 37571.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 04:22 37570.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 02:54 37569.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 01:27 37568.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 23:59 37567.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 22:31 37566.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 21:03 37565.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 19:36 37564.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 18:08 37563.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 16:41 37562.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 15:13 37561.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 13:45 37560.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 12:17 37559.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 10:50 37558.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 09:22 37557.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 07:54 37556.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 06:27 37555.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 04:59 37554.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 03:31 37553.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 02:03 37552.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 00:35 37551.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 23:07 37550.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 21:40 37549.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 20:11 37548.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 18:44 37547.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 17:17 37546.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 15:50 37545.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 14:22 37544.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 12:54 37543.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 11:26 37542.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 09:59 37541.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 08:31 37540.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 07:03 37539.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 05:35 37538.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 04:07 37537.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 02:39 37536.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 01:12 37535.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 23:44 37534.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 22:16 37533.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 20:48 37532.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 19:21 37531.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 17:53 37530.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 16:25 37529.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 14:57 37528.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 13:30 37527.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 12:02 37526.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 10:34 37525.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 09:06 37524.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 07:38 37523.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 06:10 37522.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 04:43 37521.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 03:15 37520.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 01:47 37519.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 00:19 37518.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 22:51 37517.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 21:24 37516.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 19:56 37515.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 18:28 37514.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 17:00 37513.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 15:33 37512.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 14:04 37511.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 12:37 37510.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 11:09 37509.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 09:41 37508.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 08:13 37507.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 06:46 37506.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 05:18 37505.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 03:50 37504.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 02:22 37503.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 00:54 37502.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 23:27 37501.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 21:59 37500.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 20:31 37499.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 19:03 37498.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 17:36 37497.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 16:08 37496.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 14:40 37495.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 13:12 37494.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 11:44 37493.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 10:17 37492.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 08:49 37491.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 07:21 37490.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 05:53 37489.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 04:26 37488.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 02:58 37487.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 01:30 37486.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 00:02 37485.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 22:34 37484.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 21:07 37483.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 19:39 37482.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 18:11 37481.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 16:43 37480.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 15:15 37479.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 13:47 37478.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 12:19 37477.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 10:51 37476.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 09:24 37475.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 07:56 37474.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 06:28 37473.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 05:00 37472.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 03:33 37471.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 02:05 37470.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 00:37 37469.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 12 23:09 37468.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 12 21:41 37467.tlx
-rw-r--r-- 1 ovs ovs 2.1M Jun 12 20:14 37466.tlx
-rw-r--r-- 1 ovs ovs 6.2M Jun 12 20:14 head.db

BTW, what do you mean collapse on the other nodes ?

toolslive commented 5 years ago

this is an abm, so lots of tlogs but a small database. It's better to do this:

$ arakoon --copy-db-to-head sas-back-abm 192.168.0.35 26408 10
...
$ arakoon --copy-db-to-head sas-back-abm 192.168.0.33 26408 10 
...

It will, for the other nodes, copy the database to the head and keep 10 tlogs. This will make catchup fast. Then wipe the lagging node, and restart it.

yongshengma commented 5 years ago

OK. I emptied this abm's db (except 'albamgr_plugin.cmxs') and tlogs but I didn't collapse on the other nodes. I can see this node is catching up from 001.tlog/tlx and it looks going to reach 37613.tlx at least. It may take many days. Shall i stop the abm service and re-empty its db and tlogs directories?

toolslive commented 5 years ago

Did you run --copy-db-to-head .... 10 on the other nodes first? there should be less than 10 tlogs then.

yongshengma commented 5 years ago

No, I didn't run --copy-db-to-head on the other nodes.

yongshengma commented 5 years ago

I stopped sas-back-abm service on node3.

On node1, I ran arakoon --copy-db-to-head sas-back-abm 192.168.0.33 26408 10 and it looks ended normally.

But on node2, I got the following error:

[root@node2 ~]# arakoon --copy-db-to-head sas-back-abm 192.168.0.34 26408 10
Uncaught exception:

  Arakoon_exc.Exception(10, "Operation cannot be performed on master node")

Raised at file "src/core/lwt.ml", line 805, characters 16-23
Called from file "src/unix/lwt_main.ml", line 34, characters 8-18
Called from file "src/main/arakoon.ml" (inlined), line 523, characters 25-203
Called from file "src/main/arakoon.ml", line 612, characters 7-23
Called from file "src/main/arakoon.ml", line 626, characters 9-16

All these 3 nodes are master nodes.

toolslive commented 5 years ago
$ arakoon  --drop-master sas-back-abm 192.168.0.34 26408 
# requests the arakoon node to no longer be master (for now)
$ arakoon --copy-db-to-head sas-back-abm 192.168.0.34 26408 10
# re-run the operation after the node is no longer master.

Then, restart the wiped arakoon on the 3rd node.

yongshengma commented 5 years ago

OK, --drop-master works. I followed your suggestion and let's see how quick it will be. Really appreciate!