race condition when flock to chef pause file fails on a node

tpatzig commented 7 years ago

We had an outage in production Symptoms:

nova api no longer available. haproxy.cfg was changed by local chef-client run on the control-node running haproxy. All nova services where removed from haproxy.cfg. nova.conf was changed as well: all listen IPs/ports were reset to default.

To fix it quick we restored the local chef backup of the config files.

First analysis:

Nova proposal was applied to include new computes. One older compute had an issue and the proposal failed with:

"crowbar-failed": "Failed to apply the proposal: On d00-25-b5-a0-02-d7.os4.eu-de-1.cc.cloud.sap, 'flock /var/chef/cache/pause-file.lock.meta bash -es' (pid 2146) failed, exitcode 255\nSTDERR:\nssh: connect to host d00-25-b5-a0-02-d7.os4.eu-de-1.cc.cloud.sap port 22: No route to host

In the same time (or right after) the local/periodic chef-client run on the control-nodes happened (because the flock didn't ran over all nodes) and change the nova.conf and the haproxy.cfg. The local chef-client run failed:

ERROR: RuntimeError: crowbar-pacemaker_sync_mark[wait-nova_database] (nova::database line 28) had an error: RuntimeError: Cluster founder didn't set nova_database to 1!

The node mentioned in the proposal error was no longer in the kvm compute list of the nova proposal, but it still had the nova role assigned on its crowbar role. We manually removed it and re-applied nova proposal. That run was successful. No haproxy.cfg or nova.conf changes. The following local chef-client runs were also sucessful.

So my guess is, that the missing flock on the chef pause file, resulted in this race condition, where the local chef-client got wrong/default data (from the founder?).

matelakat commented 7 years ago

This is blocked due to @vuntz missing a profile - so he cannot log in to the system and investigate

vuntz commented 7 years ago

For the record, the "no route to host" error occurred at 2017-04-19T13:42:25.501410 (in case others need to look at production.log).

vuntz commented 7 years ago

So, I'm seeing some nodes where the nova entries for haproxy got dropped during the periodic chef-client run at 2017-04-19T14:45:17+00:00 and 2017-04-19T15:13:53+00:00. What's interesting is that it's the first periodic chef-client run after the apply failure.

matelakat commented 7 years ago

https://github.com/sap-oc/crowbar-openstack/blob/sci1-2017-03-14-00/chef/cookbooks/nova/recipes/controller_ha.rb#L16

vuntz commented 7 years ago

So here's what happened:

the periodic chef run didn't do HA for nova due to the attribute for HA in nova being absent / set to false
this happened because apply_role_pre_chef_client wasn't called
this wasn't called because of the lock issue which happens in apply_role before apply_role_pre_chef_client is called
so why was the attribute wrong? Because before apply_role, we copy the proposal to the chef role nova-config-default and the data in there is not fully "correct" until apply_role_pre_chef_client is called. Therefore we ended up with the nova attributes coming from the the nova-config-default role being incomplete (basically lacking everything from apply_role_pre_chef_client, including the HA bits).

Not sure yet how to best fix it.

vuntz commented 7 years ago

I think https://github.com/crowbar/crowbar-core/pull/1225 is the right fix.

matelakat commented 7 years ago

Testing the change locally.

matelakat commented 7 years ago

Steps to reproduce:

Have an HA cloud

root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled:  true

Deny crowbar from connecting to node1:

root@crowbar:~ # iptables -A OUTPUT -p tcp --dport ssh -d 192.168.124.82 -j REJECT

Apply nova barclamp through the UI - see that it fails
Log in to node1, do a chef-client run

See that ha is reported to be false

root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled:  false

In case the node got fenced, you can recover it with:

rm  /var/spool/corosync/block_automatic_start
systemctl start crowbar_join.service

tpatzig commented 7 years ago

Great! That's exactly how it can be reproduced. Thanks @matelakat !

matelakat commented 7 years ago

Applying the fix:

curl -L https://patch-diff.githubusercontent.com/raw/crowbar/crowbar-core/pull/1225.patch | patch -p1 -d /opt/dell
systemctl restart crowbar.service
root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled:  true
# Apply nova proposal, see that it fails
root@crowbar:~ # iptables -D OUTPUT -p tcp --dport ssh -d 192.168.124.82 -j REJECT
root@crowbar:~ # ssh node1 chef-client
root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled:  true

vuntz commented 7 years ago

Backport at https://github.com/sap-oc/crowbar-core/pull/29

sap-oc / crowbar-openstack

race condition when flock to chef pause file fails on a node #34