Closed JeffreyDevloo closed 7 years ago
@JeffreyDevloo what happened on this env? Why was there so much maintenance work to do? Any logs you can add so we can investigate what it was doing?
There was no need to do repair work. It's a bug in the detection of when auto-repair should happen (as evidenced by the fact that disabling auto-repair made the load go away).
@domsj what do you want to to with this bug? Can we fix it in Fargo? Is there a workaround (disable repair?)
I'm not exactly sure yet where the bug is, so I can't immediately fix it. (Some more code inspection might bring something up though.) I don't know why @JeffreyDevloo has seen this but why we haven't seen it elsewhere. I suggest leaving it open for now, but remove the SRP label. (If we should start seeing it again on other envs it probably makes sense to further investigate.)
Please add a higher priority if this would happen again.
It happened again on a @JeffreyDevloo env, not sure what he's doing wrong ;-)
The maintenance process is hoarding the CPU for his own. I only saw it hoarding cpu on one node though this time.
root 2685 331 0.7 727112 122468 ? Rsl Nov22 3512:27 /usr/bin/alba maintenance --config arakoon://config/ovs/alba/backends/a724fb57-1d36-4462-9252-af08f7a11093/maintenance/config?ini=%2Fopt%2Fasd-manager%2Fconfig%2Farakoon_cacc.ini --log-sink console:
When I returned I found that the maintenance was spiking in cpu usage. Had to take to following steps because my root partition was full with connection logs of arakoon:
In the logs I found:
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 334842 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469378 - info - Exn while repairing osd 49 (~namespace_id:2 ~object ~name:"00_000000d7_00" ~object_id:"\155\150\bs\154\004\152\239\159>>2\003[\219\2020`\189$5\186(\223\006)0L\178\021\179H"), will now try object rewrite: Nsm_model.Err.Nsm_exn(7, ""); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 514456 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469396 - info - Exn while repairing osd 3 (~namespace_id:2 ~object ~name:"00_00000155_00" ~object_id:"\023$4\007\208x/S\213\178H\202V\219\220k\196\206R\162\203\151\202\155\252;\193\230m\015\007\232"), will now try object rewrite: Nsm_model.Err.Nsm_exn(7, ""); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 587093 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469422 - info - Exn while repairing osd 52 (~namespace_id:2 ~object ~name:"00_000000c5_00" ~object_id:"8R\139\163\2044K\015\207\240=\253\199lFC\025#y\169\000\136\180K\149\186\148+\146S\210\152"), will now try object rewrite: Alba_client_errors.Error.Exn(8); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 593491 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469464 - info - Exn while repairing osd 52 (~namespace_id:2 ~object ~name:"00_000003c3_00" ~object_id:"7a\017\206\018i\153u\2292D\158M\020RL\170\233\237\012\2163\225'Y\184\0062\192\162\230\147"), will now try object rewrite: Alba_client_errors.Error.Exn(8); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 686855 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469500 - info - Exn while repairing osd 16 (~namespace_id:2 ~object ~name:"00_00000015_00" ~object_id:"amys\nb\148\142?]1\224\185Q\212\2218\191*Bs\143\011B\199\168\159\171\bzQ%"), will now try object rewrite: Nsm_model.Err.Nsm_exn(7, ""); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
This and many more Exn while repairing osd XX
At first the CPU usage spiket back to 350% but after around 10 minutes, the maintenance process was only using 10%.
It's still unclear why this happened. Added some more logging that should be available in the next version
@domsj what needs to happen with this ticket? It is in status In Progress but no one is working on it/assigned to it?
This might not fix everything observed in this ticket, but closing it nonetheless. New observations need a new ticket then.
Problem description
CPU usage spiking on all nodes within a cluster. (see first picture below) The CPU spike is coming from alba maintenance (see second picture below)
Possible root of the problem
Unknown
Possible solution
Unknown
Temporary solution
Disabling auto-repair: alba update-maintenance-config --config etcd://127.0.0.1:2379/ovs/arakoon/vm-backend-abm/config --disable-auto-repair
Additional information
Complete log file (gzip)
alba-maintenance_vm-backend-wJ4OUV0jLiZe4P9H.log.gz
Setup
Hyperconverged setup
Package information