sap-oc / crowbar-openstack

Openstack deployment for Crowbar
3 stars 1 forks source link

race condition when flock to chef pause file fails on a node #34

Closed tpatzig closed 7 years ago

tpatzig commented 7 years ago

We had an outage in production Symptoms:

nova api no longer available. haproxy.cfg was changed by local chef-client run on the control-node running haproxy. All nova services where removed from haproxy.cfg. nova.conf was changed as well: all listen IPs/ports were reset to default.

To fix it quick we restored the local chef backup of the config files.

First analysis:

Nova proposal was applied to include new computes. One older compute had an issue and the proposal failed with:

"crowbar-failed": "Failed to apply the proposal: On d00-25-b5-a0-02-d7.os4.eu-de-1.cc.cloud.sap, 'flock /var/chef/cache/pause-file.lock.meta bash -es' (pid 2146) failed, exitcode 255\nSTDERR:\nssh: connect to host d00-25-b5-a0-02-d7.os4.eu-de-1.cc.cloud.sap port 22: No route to host

In the same time (or right after) the local/periodic chef-client run on the control-nodes happened (because the flock didn't ran over all nodes) and change the nova.conf and the haproxy.cfg. The local chef-client run failed:

ERROR: RuntimeError: crowbar-pacemaker_sync_mark[wait-nova_database] (nova::database line 28) had an error: RuntimeError: Cluster founder didn't set nova_database to 1!

The node mentioned in the proposal error was no longer in the kvm compute list of the nova proposal, but it still had the nova role assigned on its crowbar role. We manually removed it and re-applied nova proposal. That run was successful. No haproxy.cfg or nova.conf changes. The following local chef-client runs were also sucessful.

So my guess is, that the missing flock on the chef pause file, resulted in this race condition, where the local chef-client got wrong/default data (from the founder?).

matelakat commented 7 years ago

This is blocked due to @vuntz missing a profile - so he cannot log in to the system and investigate

vuntz commented 7 years ago

For the record, the "no route to host" error occurred at 2017-04-19T13:42:25.501410 (in case others need to look at production.log).

vuntz commented 7 years ago

So, I'm seeing some nodes where the nova entries for haproxy got dropped during the periodic chef-client run at 2017-04-19T14:45:17+00:00 and 2017-04-19T15:13:53+00:00. What's interesting is that it's the first periodic chef-client run after the apply failure.

matelakat commented 7 years ago

https://github.com/sap-oc/crowbar-openstack/blob/sci1-2017-03-14-00/chef/cookbooks/nova/recipes/controller_ha.rb#L16

vuntz commented 7 years ago

So here's what happened:

Not sure yet how to best fix it.

vuntz commented 7 years ago

I think https://github.com/crowbar/crowbar-core/pull/1225 is the right fix.

matelakat commented 7 years ago

Testing the change locally.

matelakat commented 7 years ago

Steps to reproduce:

  1. Have an HA cloud
    root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
    nova.ha.enabled:  true
  2. Deny crowbar from connecting to node1:
    root@crowbar:~ # iptables -A OUTPUT -p tcp --dport ssh -d 192.168.124.82 -j REJECT
    1. Apply nova barclamp through the UI - see that it fails
    2. Log in to node1, do a chef-client run
    3. See that ha is reported to be false
      root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
      nova.ha.enabled:  false

In case the node got fenced, you can recover it with:

rm  /var/spool/corosync/block_automatic_start
systemctl start crowbar_join.service
tpatzig commented 7 years ago

Great! That's exactly how it can be reproduced. Thanks @matelakat !

matelakat commented 7 years ago

Applying the fix:

curl -L https://patch-diff.githubusercontent.com/raw/crowbar/crowbar-core/pull/1225.patch | patch -p1 -d /opt/dell
systemctl restart crowbar.service
root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled:  true
# Apply nova proposal, see that it fails
root@crowbar:~ # iptables -D OUTPUT -p tcp --dport ssh -d 192.168.124.82 -j REJECT
root@crowbar:~ # ssh node1 chef-client
root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled:  true
vuntz commented 7 years ago

Backport at https://github.com/sap-oc/crowbar-core/pull/29