Closed tpatzig closed 7 years ago
This is blocked due to @vuntz missing a profile - so he cannot log in to the system and investigate
For the record, the "no route to host" error occurred at 2017-04-19T13:42:25.501410 (in case others need to look at production.log).
So, I'm seeing some nodes where the nova entries for haproxy got dropped during the periodic chef-client run at 2017-04-19T14:45:17+00:00 and 2017-04-19T15:13:53+00:00. What's interesting is that it's the first periodic chef-client run after the apply failure.
So here's what happened:
apply_role_pre_chef_client
wasn't calledapply_role
before apply_role_pre_chef_client
is calledapply_role
, we copy the proposal to the chef role nova-config-default
and the data in there is not fully "correct" until apply_role_pre_chef_client
is called. Therefore we ended up with the nova attributes coming from the the nova-config-default
role being incomplete (basically lacking everything from apply_role_pre_chef_client
, including the HA bits).Not sure yet how to best fix it.
I think https://github.com/crowbar/crowbar-core/pull/1225 is the right fix.
Testing the change locally.
Steps to reproduce:
root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled: true
root@crowbar:~ # iptables -A OUTPUT -p tcp --dport ssh -d 192.168.124.82 -j REJECT
root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled: false
In case the node got fenced, you can recover it with:
rm /var/spool/corosync/block_automatic_start
systemctl start crowbar_join.service
Great! That's exactly how it can be reproduced. Thanks @matelakat !
Applying the fix:
curl -L https://patch-diff.githubusercontent.com/raw/crowbar/crowbar-core/pull/1225.patch | patch -p1 -d /opt/dell
systemctl restart crowbar.service
root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled: true
# Apply nova proposal, see that it fails
root@crowbar:~ # iptables -D OUTPUT -p tcp --dport ssh -d 192.168.124.82 -j REJECT
root@crowbar:~ # ssh node1 chef-client
root@crowbar:~ # knife node show d52-54-77-77-01-01.virtual.cloud.suse.de -a nova.ha.enabled
nova.ha.enabled: true
Backport at https://github.com/sap-oc/crowbar-core/pull/29
We had an outage in production Symptoms:
nova api no longer available. haproxy.cfg was changed by local chef-client run on the control-node running haproxy. All nova services where removed from haproxy.cfg. nova.conf was changed as well: all listen IPs/ports were reset to default.
To fix it quick we restored the local chef backup of the config files.
First analysis:
Nova proposal was applied to include new computes. One older compute had an issue and the proposal failed with:
In the same time (or right after) the local/periodic chef-client run on the control-nodes happened (because the flock didn't ran over all nodes) and change the nova.conf and the haproxy.cfg. The local chef-client run failed:
The node mentioned in the proposal error was no longer in the kvm compute list of the nova proposal, but it still had the nova role assigned on its crowbar role. We manually removed it and re-applied nova proposal. That run was successful. No haproxy.cfg or nova.conf changes. The following local chef-client runs were also sucessful.
So my guess is, that the missing flock on the chef pause file, resulted in this race condition, where the local chef-client got wrong/default data (from the founder?).