Closed nerdicbynature closed 5 months ago
Seems that the missing LF actually is a problem for the new haproxy version that wasnt a problem for previous releases.
In our case this is a result of a wrongly apply overlay file that ends with:
use_backend acme_client_back if { path_reg ^/.well-known/acme-challenge/.+ }
option httplog
option forwardfor
default_backend xyz_server
{%- endraw %}
{%- endif %}
The handling of the restarts has to be looked at in the upstream.
The problem itself then results from a broken overlay file?
Yes! We haven't tried without overlay file though.
The main issue here is the Kolla continues to upgrade master Haproxy even after backups are not functional. This has lead to a downtime before in another non-productive environment before. Seems that Haproxy is very strict with regards to whitespaces and newline characters
For a test I added a bad haproxy configuration in environments/kolla/files/overlays/haproxy/services.d/haproxy.cfg
and initiated the upgrade with osism apply -an upgrade loadbalancer
.
The backup haproxy containers will be stopped & started again.
[...]
TASK [loadbalancer : Copying over custom haproxy services configuration] *******
Monday 03 June 2024 19:36:12 +0000 (0:00:02.184) 0:00:54.230 ***********
changed: [testbed-node-0.testbed.osism.xyz] => (item=/opt/configuration/environments/kolla/files/overlays/haproxy/services.d/haproxy.cfg)
changed: [testbed-node-1.testbed.osism.xyz] => (item=/opt/configuration/environments/kolla/files/overlays/haproxy/services.d/haproxy.cfg)
changed: [testbed-node-2.testbed.osism.xyz] => (item=/opt/configuration/environments/kolla/files/overlays/haproxy/services.d/haproxy.cfg)
[...]
RUNNING HANDLER [loadbalancer : Stop backup keepalived container] **************
Monday 03 June 2024 19:44:49 +0000 (0:00:01.456) 0:09:31.333 ***********
skipping: [testbed-node-1.testbed.osism.xyz]
changed: [testbed-node-0.testbed.osism.xyz]
changed: [testbed-node-2.testbed.osism.xyz]
RUNNING HANDLER [loadbalancer : Stop backup haproxy container] *****************
Monday 03 June 2024 19:44:56 +0000 (0:00:07.545) 0:09:38.878 ***********
skipping: [testbed-node-1.testbed.osism.xyz]
changed: [testbed-node-0.testbed.osism.xyz]
changed: [testbed-node-2.testbed.osism.xyz]
RUNNING HANDLER [loadbalancer : Stop backup proxysql container] ****************
Monday 03 June 2024 19:45:03 +0000 (0:00:07.089) 0:09:45.968 ***********
skipping: [testbed-node-1.testbed.osism.xyz]
ok: [testbed-node-0.testbed.osism.xyz]
ok: [testbed-node-2.testbed.osism.xyz]
RUNNING HANDLER [loadbalancer : Start backup haproxy container] ****************
Monday 03 June 2024 19:45:05 +0000 (0:00:01.921) 0:09:47.890 ***********
skipping: [testbed-node-1.testbed.osism.xyz]
changed: [testbed-node-0.testbed.osism.xyz]
changed: [testbed-node-2.testbed.osism.xyz]
RUNNING HANDLER [loadbalancer : Wait for backup haproxy to start] **************
Monday 03 June 2024 19:45:12 +0000 (0:00:06.811) 0:09:54.702 ***********
STILL ALIVE [task 'loadbalancer : Wait for backup haproxy to start' is running] ***
STILL ALIVE [task 'loadbalancer : Wait for backup haproxy to start' is running] ***
[...]
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for 192.168.16.10:61313"}
fatal: [testbed-node-2.testbed.osism.xyz]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for 192.168.16.12:61313"}
RUNNING HANDLER [loadbalancer : Start backup proxysql container] ***************
Monday 03 June 2024 19:50:16 +0000 (0:05:03.380) 0:14:58.082 ***********
skipping: [testbed-node-1.testbed.osism.xyz]
RUNNING HANDLER [loadbalancer : Start backup keepalived container] *************
Monday 03 June 2024 19:50:17 +0000 (0:00:01.153) 0:14:59.236 ***********
skipping: [testbed-node-1.testbed.osism.xyz]
RUNNING HANDLER [loadbalancer : Stop master haproxy container] *****************
Monday 03 June 2024 19:50:18 +0000 (0:00:01.226) 0:15:00.462 ***********
changed: [testbed-node-1.testbed.osism.xyz]
RUNNING HANDLER [loadbalancer : Stop master proxysql container] ****************
Monday 03 June 2024 19:50:25 +0000 (0:00:06.884) 0:15:07.347 ***********
ok: [testbed-node-1.testbed.osism.xyz]
RUNNING HANDLER [loadbalancer : Stop master keepalived container] **************
Monday 03 June 2024 19:50:27 +0000 (0:00:01.929) 0:15:09.277 ***********
changed: [testbed-node-1.testbed.osism.xyz]
RUNNING HANDLER [loadbalancer : Start master haproxy container] ****************
Monday 03 June 2024 19:50:34 +0000 (0:00:06.947) 0:15:16.224 ***********
changed: [testbed-node-1.testbed.osism.xyz]
RUNNING HANDLER [loadbalancer : Wait for master haproxy to start] **************
Monday 03 June 2024 19:50:40 +0000 (0:00:06.098) 0:15:22.323 ***********
STILL ALIVE [task 'loadbalancer : Wait for master haproxy to start' is running] ***
STILL ALIVE [task 'loadbalancer : Wait for master haproxy to start' is running] ***
STILL ALIVE [task 'loadbalancer : Wait for master haproxy to start' is running] ***
[...]
The expected behaviour is to fail hard after the failed Wait for backup haproxy to start' is running
task. This way the backup containers are not working, but the master container is still up and running.
Can be fixed by marking both Wait for backup handlers with any_errors_fatal: true
.
Upstream fix: https://review.opendev.org/c/openstack/kolla-ansible/+/921071
With this change:
[...]
RUNNING HANDLER [loadbalancer : Wait for backup haproxy to start] **************
Monday 03 June 2024 20:04:29 +0000 (0:00:06.334) 0:01:54.036 ***********
STILL ALIVE [task 'loadbalancer : Wait for backup haproxy to start' is running] ***
STILL ALIVE [task 'loadbalancer : Wait for backup haproxy to start' is running] ***
STILL ALIVE [task 'loadbalancer : Wait for backup haproxy to start' is running] ***
STILL ALIVE [task 'loadbalancer : Wait for backup haproxy to start' is running] ***
STILL ALIVE [task 'loadbalancer : Wait for backup haproxy to start' is running] ***
STILL ALIVE [task 'loadbalancer : Wait for backup haproxy to start' is running] ***
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for 192.168.16.10:61313"}
fatal: [testbed-node-1.testbed.osism.xyz]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for 192.168.16.11:61313"}
NO MORE HOSTS LEFT *************************************************************
PLAY RECAP *********************************************************************
2024-06-03 20:09:33 | INFO | Play has been completed. There may now be a delay until all logs have been written.
2024-06-03 20:09:33 | INFO | Please wait and do not abort execution.
testbed-manager.testbed.osism.xyz : ok=2 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
testbed-node-0.testbed.osism.xyz : ok=28 changed=5 unreachable=0 failed=1 skipped=9 rescued=0 ignored=0
testbed-node-1.testbed.osism.xyz : ok=28 changed=5 unreachable=0 failed=1 skipped=9 rescued=0 ignored=0
testbed-node-2.testbed.osism.xyz : ok=24 changed=2 unreachable=0 failed=0 skipped=13 rescued=0 ignored=0
Backup haproxy services are not healthy -> play fails. master is still up and running.
Hi,
will running "osism-kolla upgrade loadbalance" upgrading fails as follows:
In my opinion the master haproxy must not be restarted if the backup haproxys are in an unknown state. The current behaviour leads to a downtime.
Haproxy error log:
We are currently investigation whether the last LF is the problem or if the problem is somewhere else.