osism / issues

This repository is used for bug reports that are cross-project or not bound to a specific repository (or to an unknown repository).
https://www.osism.tech
1 stars 1 forks source link

5.0.0.c: osism-kolla upgrade loadbalancer fails and stops all haproxies #483

Closed nerdicbynature closed 5 months ago

nerdicbynature commented 1 year ago

Hi,

will running "osism-kolla upgrade loadbalance" upgrading fails as follows:

RUNNING HANDLER [loadbalancer : Wait for backup haproxy to start] ***********************************************************************************************************************************************************************************************************************************************************************************************

fatal: [control2-dev2-az0.pco.ps-intern.de]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for 10.70.194.82:61313"}
fatal: [control1-dev2-az0.pco.ps-intern.de]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for 10.70.194.81:61313"}

RUNNING HANDLER [loadbalancer : Start backup proxysql container] ************************************************************************************************************************************************************************************************************************************************************************************************
skipping: [control3-dev2-az0.pco.ps-intern.de]

RUNNING HANDLER [loadbalancer : Start backup keepalived container] **********************************************************************************************************************************************************************************************************************************************************************************************
skipping: [control3-dev2-az0.pco.ps-intern.de]

RUNNING HANDLER [loadbalancer : Stop master haproxy container] **************************************************************************************************************************************************************************************************************************************************************************************************
changed: [control3-dev2-az0.pco.ps-intern.de]

RUNNING HANDLER [loadbalancer : Stop master proxysql container] *************************************************************************************************************************************************************************************************************************************************************************************************
ok: [control3-dev2-az0.pco.ps-intern.de]

RUNNING HANDLER [loadbalancer : Stop master keepalived container] ***********************************************************************************************************************************************************************************************************************************************************************************************
changed: [control3-dev2-az0.pco.ps-intern.de]

RUNNING HANDLER [loadbalancer : Start master haproxy container] *************************************************************************************************************************************************************************************************************************************************************************************************
changed: [control3-dev2-az0.pco.ps-intern.de]

RUNNING HANDLER [loadbalancer : Wait for master haproxy to start] ***********************************************************************************************************************************************************************************************************************************************************************************************
fatal: [control3-dev2-az0.pco.ps-intern.de]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for 10.70.194.83:61313"}

In my opinion the master haproxy must not be restarted if the backup haproxys are in an unknown state. The current behaviour leads to a downtime.

Haproxy error log:

[NOTICE]   (40) : haproxy version is 2.4.18-0ubuntu1.2
[NOTICE]   (40) : path to executable is /usr/sbin/haproxy
[ALERT]    (40) : parsing [/etc/haproxy/services.d/haproxy.cfg:41]: Missing LF on last line, file might have been truncated at position 43.
[ALERT]    (40) : Error(s) found in configuration file : /etc/haproxy/services.d/haproxy.cfg
[WARNING]  (40) : parsing [/etc/haproxy/services.d/horizon.cfg:7] : a 'http-request' rule placed after a 'use_backend' rule will still be processed before.
[WARNING]  (40) : parsing [/etc/haproxy/services.d/horizon.cfg:29] : a 'http-request' rule placed after a 'use_backend' rule will still be processed before.
[ALERT]    (40) : Fatal errors found in configuration.

We are currently investigation whether the last LF is the problem or if the problem is somewhere else.

nerdicbynature commented 1 year ago

Seems that the missing LF actually is a problem for the new haproxy version that wasnt a problem for previous releases.

In our case this is a result of a wrongly apply overlay file that ends with:

  use_backend acme_client_back if { path_reg ^/.well-known/acme-challenge/.+ }
  option httplog
  option forwardfor
  default_backend xyz_server

{%- endraw %}
{%- endif %}
berendt commented 1 year ago

The handling of the restarts has to be looked at in the upstream.

The problem itself then results from a broken overlay file?

nerdicbynature commented 1 year ago

Yes! We haven't tried without overlay file though.

The main issue here is the Kolla continues to upgrade master Haproxy even after backups are not functional. This has lead to a downtime before in another non-productive environment before. Seems that Haproxy is very strict with regards to whitespaces and newline characters

berendt commented 5 months ago

For a test I added a bad haproxy configuration in environments/kolla/files/overlays/haproxy/services.d/haproxy.cfg and initiated the upgrade with osism apply -an upgrade loadbalancer.

The backup haproxy containers will be stopped & started again.

[...]
TASK [loadbalancer : Copying over custom haproxy services configuration] *******
Monday 03 June 2024  19:36:12 +0000 (0:00:02.184)       0:00:54.230 ***********
changed: [testbed-node-0.testbed.osism.xyz] => (item=/opt/configuration/environments/kolla/files/overlays/haproxy/services.d/haproxy.cfg)
changed: [testbed-node-1.testbed.osism.xyz] => (item=/opt/configuration/environments/kolla/files/overlays/haproxy/services.d/haproxy.cfg)
changed: [testbed-node-2.testbed.osism.xyz] => (item=/opt/configuration/environments/kolla/files/overlays/haproxy/services.d/haproxy.cfg)
[...]
RUNNING HANDLER [loadbalancer : Stop backup keepalived container] **************
Monday 03 June 2024  19:44:49 +0000 (0:00:01.456)       0:09:31.333 ***********
skipping: [testbed-node-1.testbed.osism.xyz]
changed: [testbed-node-0.testbed.osism.xyz]
changed: [testbed-node-2.testbed.osism.xyz]

RUNNING HANDLER [loadbalancer : Stop backup haproxy container] *****************
Monday 03 June 2024  19:44:56 +0000 (0:00:07.545)       0:09:38.878 ***********
skipping: [testbed-node-1.testbed.osism.xyz]
changed: [testbed-node-0.testbed.osism.xyz]
changed: [testbed-node-2.testbed.osism.xyz]

RUNNING HANDLER [loadbalancer : Stop backup proxysql container] ****************
Monday 03 June 2024  19:45:03 +0000 (0:00:07.089)       0:09:45.968 ***********
skipping: [testbed-node-1.testbed.osism.xyz]
ok: [testbed-node-0.testbed.osism.xyz]
ok: [testbed-node-2.testbed.osism.xyz]

RUNNING HANDLER [loadbalancer : Start backup haproxy container] ****************
Monday 03 June 2024  19:45:05 +0000 (0:00:01.921)       0:09:47.890 ***********
skipping: [testbed-node-1.testbed.osism.xyz]
changed: [testbed-node-0.testbed.osism.xyz]
changed: [testbed-node-2.testbed.osism.xyz]

RUNNING HANDLER [loadbalancer : Wait for backup haproxy to start] **************
Monday 03 June 2024  19:45:12 +0000 (0:00:06.811)       0:09:54.702 ***********

STILL ALIVE [task 'loadbalancer : Wait for backup haproxy to start' is running] ***

STILL ALIVE [task 'loadbalancer : Wait for backup haproxy to start' is running] ***
[...]
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for 192.168.16.10:61313"}
fatal: [testbed-node-2.testbed.osism.xyz]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for 192.168.16.12:61313"}

RUNNING HANDLER [loadbalancer : Start backup proxysql container] ***************
Monday 03 June 2024  19:50:16 +0000 (0:05:03.380)       0:14:58.082 ***********
skipping: [testbed-node-1.testbed.osism.xyz]

RUNNING HANDLER [loadbalancer : Start backup keepalived container] *************
Monday 03 June 2024  19:50:17 +0000 (0:00:01.153)       0:14:59.236 ***********
skipping: [testbed-node-1.testbed.osism.xyz]

RUNNING HANDLER [loadbalancer : Stop master haproxy container] *****************
Monday 03 June 2024  19:50:18 +0000 (0:00:01.226)       0:15:00.462 ***********
changed: [testbed-node-1.testbed.osism.xyz]

RUNNING HANDLER [loadbalancer : Stop master proxysql container] ****************
Monday 03 June 2024  19:50:25 +0000 (0:00:06.884)       0:15:07.347 ***********
ok: [testbed-node-1.testbed.osism.xyz]

RUNNING HANDLER [loadbalancer : Stop master keepalived container] **************
Monday 03 June 2024  19:50:27 +0000 (0:00:01.929)       0:15:09.277 ***********
changed: [testbed-node-1.testbed.osism.xyz]

RUNNING HANDLER [loadbalancer : Start master haproxy container] ****************
Monday 03 June 2024  19:50:34 +0000 (0:00:06.947)       0:15:16.224 ***********
changed: [testbed-node-1.testbed.osism.xyz]

RUNNING HANDLER [loadbalancer : Wait for master haproxy to start] **************
Monday 03 June 2024  19:50:40 +0000 (0:00:06.098)       0:15:22.323 ***********

STILL ALIVE [task 'loadbalancer : Wait for master haproxy to start' is running] ***

STILL ALIVE [task 'loadbalancer : Wait for master haproxy to start' is running] ***

STILL ALIVE [task 'loadbalancer : Wait for master haproxy to start' is running] ***
[...]

The expected behaviour is to fail hard after the failed Wait for backup haproxy to start' is running task. This way the backup containers are not working, but the master container is still up and running.

berendt commented 5 months ago

Can be fixed by marking both Wait for backup handlers with any_errors_fatal: true.

Upstream fix: https://review.opendev.org/c/openstack/kolla-ansible/+/921071

berendt commented 5 months ago

With this change:

[...]
RUNNING HANDLER [loadbalancer : Wait for backup haproxy to start] **************
Monday 03 June 2024  20:04:29 +0000 (0:00:06.334)       0:01:54.036 ***********

STILL ALIVE [task 'loadbalancer : Wait for backup haproxy to start' is running] ***

STILL ALIVE [task 'loadbalancer : Wait for backup haproxy to start' is running] ***

STILL ALIVE [task 'loadbalancer : Wait for backup haproxy to start' is running] ***

STILL ALIVE [task 'loadbalancer : Wait for backup haproxy to start' is running] ***

STILL ALIVE [task 'loadbalancer : Wait for backup haproxy to start' is running] ***

STILL ALIVE [task 'loadbalancer : Wait for backup haproxy to start' is running] ***
fatal: [testbed-node-0.testbed.osism.xyz]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for 192.168.16.10:61313"}
fatal: [testbed-node-1.testbed.osism.xyz]: FAILED! => {"changed": false, "elapsed": 300, "msg": "Timeout when waiting for 192.168.16.11:61313"}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
2024-06-03 20:09:33 | INFO     | Play has been completed. There may now be a delay until all logs have been written.
2024-06-03 20:09:33 | INFO     | Please wait and do not abort execution.
testbed-manager.testbed.osism.xyz : ok=2    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0
testbed-node-0.testbed.osism.xyz : ok=28   changed=5    unreachable=0    failed=1    skipped=9    rescued=0    ignored=0
testbed-node-1.testbed.osism.xyz : ok=28   changed=5    unreachable=0    failed=1    skipped=9    rescued=0    ignored=0
testbed-node-2.testbed.osism.xyz : ok=24   changed=2    unreachable=0    failed=0    skipped=13   rescued=0    ignored=0

Backup haproxy services are not healthy -> play fails. master is still up and running.