Open yongshengma opened 5 years ago
If this is a 3 (or more) node cluster, and it's just this node that's acting up, and it has been for a while now, it means that the other nodes in the cluster can continue without this one and this one is severely lagging anyway. The simplest thing to do is to wipe the local contents (make the database dir and the tlog dir empty) and only then restart it. It will fill itself using the others and will start participating in elections only after that.
If there's a large number of tlogs, you might want to run a collapse on the other nodes first (this is probably done once a day by the framework)
This is a 3-node cluster. I checked the other nodes. The tlogs on this node are really lagging behind a lot. Here's a look please.
[root@node3 tlogs]# ll -th
total 583M
-rw-r--r-- 1 ovs ovs 81M Jul 10 11:16 37613.tlog
-rw-r--r-- 1 ovs ovs 3.5M Jun 21 17:44 37612.tlx
-rw-r--r-- 1 ovs ovs 3.6M Jun 21 16:16 37611.tlx
-rw-r--r-- 1 ovs ovs 3.7M Jun 21 14:47 37610.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 21 13:16 37609.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 21 11:48 37608.tlx
-rw-r--r-- 1 ovs ovs 1.9M Jun 21 10:27 37607.tlx
-rw-r--r-- 1 ovs ovs 2.8M Jun 21 09:48 37606.tlx
-rw-r--r-- 1 ovs ovs 2.5M Jun 21 08:13 37605.tlx
-rw-r--r-- 1 ovs ovs 2.5M Jun 21 06:31 37604.tlx
-rw-r--r-- 1 ovs ovs 2.7M Jun 21 04:46 37603.tlx
-rw-r--r-- 1 ovs ovs 3.2M Jun 21 03:12 37602.tlx
-rw-r--r-- 1 ovs ovs 3.2M Jun 21 01:44 37601.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 21 00:16 37600.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 22:48 37599.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 21:20 37598.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 19:52 37597.tlx
-rw-r--r-- 1 ovs ovs 3.2M Jun 20 18:24 37596.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 16:58 37595.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 15:30 37594.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 14:02 37593.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 12:35 37592.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 11:07 37591.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 09:39 37590.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 08:12 37589.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 06:44 37588.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 05:16 37587.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 03:48 37586.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 02:20 37585.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 20 00:52 37584.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 19 23:24 37583.tlx
-rw-r--r-- 1 ovs ovs 3.3M Jun 19 21:55 37582.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 20:27 37581.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 19:00 37580.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 17:33 37579.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 16:05 37578.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 14:37 37577.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 13:09 37576.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 11:41 37575.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 10:14 37574.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 08:46 37573.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 07:18 37572.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 05:50 37571.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 04:22 37570.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 02:54 37569.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 19 01:27 37568.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 23:59 37567.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 22:31 37566.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 21:03 37565.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 19:36 37564.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 18:08 37563.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 16:41 37562.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 15:13 37561.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 13:45 37560.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 12:17 37559.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 10:50 37558.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 09:22 37557.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 07:54 37556.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 06:27 37555.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 04:59 37554.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 03:31 37553.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 02:03 37552.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 18 00:35 37551.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 23:07 37550.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 21:40 37549.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 20:11 37548.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 18:44 37547.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 17:17 37546.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 15:50 37545.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 14:22 37544.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 12:54 37543.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 11:26 37542.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 09:59 37541.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 08:31 37540.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 07:03 37539.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 05:35 37538.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 04:07 37537.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 02:39 37536.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 17 01:12 37535.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 23:44 37534.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 22:16 37533.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 20:48 37532.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 19:21 37531.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 17:53 37530.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 16:25 37529.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 14:57 37528.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 13:30 37527.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 12:02 37526.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 10:34 37525.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 09:06 37524.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 07:38 37523.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 06:10 37522.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 04:43 37521.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 03:15 37520.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 01:47 37519.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 16 00:19 37518.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 22:51 37517.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 21:24 37516.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 19:56 37515.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 18:28 37514.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 17:00 37513.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 15:33 37512.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 14:04 37511.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 12:37 37510.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 11:09 37509.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 09:41 37508.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 08:13 37507.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 06:46 37506.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 05:18 37505.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 03:50 37504.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 02:22 37503.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 15 00:54 37502.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 23:27 37501.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 21:59 37500.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 20:31 37499.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 19:03 37498.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 17:36 37497.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 16:08 37496.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 14:40 37495.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 13:12 37494.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 11:44 37493.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 10:17 37492.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 08:49 37491.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 07:21 37490.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 05:53 37489.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 04:26 37488.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 02:58 37487.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 01:30 37486.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 14 00:02 37485.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 22:34 37484.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 21:07 37483.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 19:39 37482.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 18:11 37481.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 16:43 37480.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 15:15 37479.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 13:47 37478.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 12:19 37477.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 10:51 37476.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 09:24 37475.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 07:56 37474.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 06:28 37473.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 05:00 37472.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 03:33 37471.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 02:05 37470.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 13 00:37 37469.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 12 23:09 37468.tlx
-rw-r--r-- 1 ovs ovs 3.5M Jun 12 21:41 37467.tlx
-rw-r--r-- 1 ovs ovs 2.1M Jun 12 20:14 37466.tlx
-rw-r--r-- 1 ovs ovs 6.2M Jun 12 20:14 head.db
BTW, what do you mean collapse on the other nodes
?
this is an abm, so lots of tlogs but a small database. It's better to do this:
$ arakoon --copy-db-to-head sas-back-abm 192.168.0.35 26408 10
...
$ arakoon --copy-db-to-head sas-back-abm 192.168.0.33 26408 10
...
It will, for the other nodes, copy the database to the head and keep 10 tlogs. This will make catchup fast. Then wipe the lagging node, and restart it.
OK. I emptied this abm's db (except 'albamgr_plugin.cmxs') and tlogs but I didn't collapse on the other nodes. I can see this node is catching up from 001.tlog/tlx and it looks going to reach 37613.tlx at least. It may take many days. Shall i stop the abm service and re-empty its db and tlogs directories?
Did you run --copy-db-to-head .... 10
on the other nodes first? there should be less than 10 tlogs then.
No, I didn't run --copy-db-to-head
on the other nodes.
I stopped sas-back-abm
service on node3.
On node1, I ran arakoon --copy-db-to-head sas-back-abm 192.168.0.33 26408 10
and it looks ended normally.
But on node2, I got the following error:
[root@node2 ~]# arakoon --copy-db-to-head sas-back-abm 192.168.0.34 26408 10
Uncaught exception:
Arakoon_exc.Exception(10, "Operation cannot be performed on master node")
Raised at file "src/core/lwt.ml", line 805, characters 16-23
Called from file "src/unix/lwt_main.ml", line 34, characters 8-18
Called from file "src/main/arakoon.ml" (inlined), line 523, characters 25-203
Called from file "src/main/arakoon.ml", line 612, characters 7-23
Called from file "src/main/arakoon.ml", line 626, characters 9-16
All these 3 nodes are master nodes.
$ arakoon --drop-master sas-back-abm 192.168.0.34 26408
# requests the arakoon node to no longer be master (for now)
$ arakoon --copy-db-to-head sas-back-abm 192.168.0.34 26408 10
# re-run the operation after the node is no longer master.
Then, restart the wiped arakoon on the 3rd node.
OK, --drop-master
works. I followed your suggestion and let's see how quick it will be. Really appreciate!
Hello,
ovs-arakoon-sas-back-abm.service
keeps on activating but it cannot start up. It has been like for a few weeks. Just wondering whats' going on:Best regards, Yongsheng