JeffreyDevloo commented 8 years ago

Problem description

CPU usage spiking on all nodes within a cluster. (see first picture below) The CPU spike is coming from alba maintenance (see second picture below) selection_003 selection_004

Possible root of the problem

Unknown

Possible solution

Unknown

Temporary solution

Disabling auto-repair: alba update-maintenance-config --config etcd://127.0.0.1:2379/ovs/arakoon/vm-backend-abm/config --disable-auto-repair

Additional information

Complete log file (gzip)

alba-maintenance_vm-backend-wJ4OUV0jLiZe4P9H.log.gz

Setup

Hyperconverged setup

Three nodes with each three disks for the back-end
Package information
ii openvstorage 2.7.1-fargo.2-1 amd64 openvStorage
ii openvstorage-backend 1.7.1-fargo.1-1 amd64 openvStorage Backend plugin
ii openvstorage-backend-core 1.7.1-fargo.1-1 amd64 openvStorage Backend plugin core
ii openvstorage-backend-webapps 1.7.1-fargo.1-1 amd64 openvStorage Backend plugin Web Applications
ii openvstorage-cinder-plugin 1.2.1-fargo.1-1 amd64 OpenvStorage Cinder plugin for OpenStack
ii openvstorage-core 2.7.1-fargo.2-1 amd64 openvStorage core
ii openvstorage-hc 1.7.1-fargo.1-1 amd64 openvStorage Backend plugin HyperConverged
ii openvstorage-sdm 1.6.1-fargo.1-1 amd64 Open vStorage Backend ASD Manager
ii openvstorage-test 2.7.1-fargo.2-1 amd64 openvStorage autotest suite
ii openvstorage-webapps 2.7.1-fargo.2-1 amd64 openvStorage Web Applications

wimpers commented 8 years ago

@JeffreyDevloo what happened on this env? Why was there so much maintenance work to do? Any logs you can add so we can investigate what it was doing?

domsj commented 8 years ago

There was no need to do repair work. It's a bug in the detection of when auto-repair should happen (as evidenced by the fact that disabling auto-repair made the load go away).

wimpers commented 7 years ago

@domsj what do you want to to with this bug? Can we fix it in Fargo? Is there a workaround (disable repair?)

domsj commented 7 years ago

I'm not exactly sure yet where the bug is, so I can't immediately fix it. (Some more code inspection might bring something up though.) I don't know why @JeffreyDevloo has seen this but why we haven't seen it elsewhere. I suggest leaving it open for now, but remove the SRP label. (If we should start seeing it again on other envs it probably makes sense to further investigate.)

wimpers commented 7 years ago

Please add a higher priority if this would happen again.

domsj commented 7 years ago

It happened again on a @JeffreyDevloo env, not sure what he's doing wrong ;-)

JeffreyDevloo commented 7 years ago

Problem description

The maintenance process is hoarding the CPU for his own. I only saw it hoarding cpu on one node though this time.

root      2685  331  0.7 727112 122468 ?       Rsl  Nov22 3512:27 /usr/bin/alba maintenance --config arakoon://config/ovs/alba/backends/a724fb57-1d36-4462-9252-af08f7a11093/maintenance/config?ini=%2Fopt%2Fasd-manager%2Fconfig%2Farakoon_cacc.ini --log-sink console:

Setup

3 node setup
disklayout is the following (identical for every node)
Backend

Steps that I executed

First I deleted the disk with roles (sdd) and an asd disk (sda) on node 1
Then I removed 15 of the 18 asds on node 2.
Afterwards I left the environment running for the night

When I returned I found that the maintenance was spiking in cpu usage. Had to take to following steps because my root partition was full with connection logs of arakoon:

Remove the syslog.1 file

In the logs I found:

Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 334842 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469378 - info - Exn while repairing osd 49 (~namespace_id:2 ~object ~name:"00_000000d7_00" ~object_id:"\155\150\bs\154\004\152\239\159>>2\003[\219\2020`\189$5\186(\223\006)0L\178\021\179H"), will now try object rewrite: Nsm_model.Err.Nsm_exn(7, ""); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 514456 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469396 - info - Exn while repairing osd 3 (~namespace_id:2 ~object ~name:"00_00000155_00" ~object_id:"\023$4\007\208x/S\213\178H\202V\219\220k\196\206R\162\203\151\202\155\252;\193\230m\015\007\232"), will now try object rewrite: Nsm_model.Err.Nsm_exn(7, ""); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 587093 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469422 - info - Exn while repairing osd 52 (~namespace_id:2 ~object ~name:"00_000000c5_00" ~object_id:"8R\139\163\2044K\015\207\240=\253\199lFC\025#y\169\000\136\180K\149\186\148+\146S\210\152"), will now try object rewrite: Alba_client_errors.Error.Exn(8); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 593491 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469464 - info - Exn while repairing osd 52 (~namespace_id:2 ~object ~name:"00_000003c3_00" ~object_id:"7a\017\206\018i\153u\2292D\158M\020RL\170\233\237\012\2163\225'Y\184\0062\192\162\230\147"), will now try object rewrite: Alba_client_errors.Error.Exn(8); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 686855 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469500 - info - Exn while repairing osd 16 (~namespace_id:2 ~object ~name:"00_00000015_00" ~object_id:"amys\nb\148\142?]1\224\185Q\212\2218\191*Bs\143\011B\199\168\159\171\bzQ%"), will now try object rewrite: Nsm_model.Err.Nsm_exn(7, ""); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46

This and many more Exn while repairing osd XX

Temporary solution

Disabled the auto repair and waited 2 mins
Killed the cpu hoarding process
Restarted the maintenance service (no high cpu usage this time)
Enabled auto repair again (and waited 2 mins)

At first the CPU usage spiket back to 350% but after around 10 minutes, the maintenance process was only using 10%.

domsj commented 7 years ago

It's still unclear why this happened. Added some more logging that should be available in the next version

wimpers commented 7 years ago

@domsj what needs to happen with this ticket? It is in status In Progress but no one is working on it/assigned to it?

toolslive commented 7 years ago

708 fixes a case where maintenance starts spinning while trying to repair a bucket.

This might not fix everything observed in this ticket, but closing it nonetheless. New observations need a new ticket then.

openvstorage / alba

Auto-repair high cpu load #312

Problem description

Possible root of the problem

Possible solution

Temporary solution

Additional information

Complete log file (gzip)

Setup

Package information

Problem description

Setup

Steps that I executed

Temporary solution

708 fixes a case where maintenance starts spinning while trying to repair a bucket.