Closed maliblatt closed 6 months ago
This is the related code: https://github.com/openstack/kolla-ansible/blob/master/ansible/module_utils/kolla_docker_worker.py#L360-L382
self.stop_container()
self.remove_container()
self.start_container()
The process is made up of 3 phases. The runtime of the task itself does not allow any specific conclusions to be drawn about the actual unavailability of the container. We would first have to find out whether the problem is caused by a certain number of nodes being processed simultaneously and whether the use of throttling improves the runtime on the manager itself.
Maybe related: https://bugs.launchpad.net/kolla-ansible/+bug/2048130
# cat kolla-cron-container.service
# kolla-cron-container.service
# autogenerated by Kolla-Ansible
[Unit]
Description=docker kolla-cron-container.service
After=docker.service
Requires=docker.service
StartLimitIntervalSec=120
StartLimitBurst=10
[Service]
ExecStart=/usr/bin/docker start -a cron
ExecStop=/usr/bin/docker stop cron -t 60
Restart=always
RestartSec=11
[Install]
WantedBy=multi-user.target
The list of affected services matches our observations:
kolla-cron-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:02:06 UTC; 30s ago
kolla-designate_producer-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:06:32 UTC; 30s ago
kolla-keystone_fernet-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:15:44 UTC; 30s ago
kolla-letsencrypt_lego-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:17:46 UTC; 30s ago
kolla-magnum_api-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:19:07 UTC; 30s ago
kolla-mariadb_clustercheck-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:21:18 UTC; 30s ago
kolla-neutron_l3_agent-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:24:08 UTC; 30s ago
kolla-openvswitch_db-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:34:26 UTC; 30s ago
kolla-openvswitch_vswitchd-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:35:06 UTC; 30s ago
kolla-proxysql-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:36:28 UTC; 30s ago
The patch (https://review.opendev.org/c/openstack/kolla-ansible/+/904805) is back ported to stable/2023.1 & stable/2023.2. But it's not part of OSISM 6.0.2.
Possible solution: write a play that adds SuccessExitStatus=143
to all relevant units and reloads systemd. This play has to run before the Kolla upgrade.
@maliblatt Can you add this play to environments/custom/playbook-fix-973.yml
and test it on a single node with osism apply fix-973 -l NODE
. This will add the bug fix from https://bugs.launchpad.net/kolla-ansible/+bug/2048130 to all required unit files and will reload systemd afterwards. This should fix the issue. I will run a test in the evening.
Play will be included in 7.0.3.
---
- name: Fix for osism/issues#973
hosts: "{{ hosts_fix_973|default('common') }}"
vars:
unit_files:
- kolla-cron-container.service
- kolla-designate_producer-container.service
- kolla-keystone_fernet-container.service
- kolla-letsencrypt_lego-container.service
- kolla-magnum_api-container.service
- kolla-mariadb_clustercheck-container.service
- kolla-neutron_l3_agent-container.service
- kolla-openvswitch_db-container.service
- kolla-openvswitch_vswitchd-container.service
- kolla-proxysql-container.service
tasks:
- name: Check the unit files to be repaired
ansible.builtin.stat:
path: "/etc/systemd/system/{{ item }}"
loop: "{{ unit_files }}"
register: result
- name: Repair unit file
become: true
ansible.builtin.lineinfile:
path: "/etc/systemd/system/{{ item.item }}"
insertafter: "^RestartSec="
line: "SuccessExitStatus=143"
loop: "{{ result.results }}"
when: item["stat"].exists|bool
- name: Reload systemd daemon
become: true
ansible.builtin.systemd:
daemon_reload: true
Keep open until everything is tested.
I have tested the patch successfully! With the patched systemd files the container restarts are as fast as expected and we do not see the long downtime any longer! I have tested the upgrade on two hosts in the same environment, one patched and one non-patched.
On the non-patched host we could see the downtime of around 2 minutes clearly for each affected container, which leads to a huge amount of packet loss for the running instances.
see:
867aa91865b6 quay.io/osism/openvswitch-vswitchd:3.1.2.20230919 "dumb-init --single-…" 5 months ago Exited (143) 2 minutes ago openvswitch_vswitchd
On the previously patched host (with osism apply fix-973 -l NODE) there was no downtime visible, the container were restarting immediately again.
thanks a lot for the quick fix!
Also tested. Looks good.
We recognized at the last upgrade to OSISM7 (but maybe already with the last Update) that a lot of kolla containers are having a very long downtime in the upgrade process. We saw this behavior in several roles, for example in the common role. That would not be a big deal, but this long downtimes can also be found with openvswitch and ovn-controllers. That leads to a partially network connectivity outgage of the instances for a (unneccessary?) long time. All docker images are pre-pulled so that this should not be the reason for the long downtime.
see here:
2024-04-15 09:35:41,458 p=12672 u=dragon n=ansible | TASK [openvswitch : Flush Handlers]
2024-04-15 09:35:42,953 p=12672 u=dragon n=ansible | RUNNING HANDLER [openvswitch : Restart openvswitch-db-server container]
2024-04-15 09:37:45,581 p=12672 u=dragon n=ansible | STILL ALIVE [task 'openvswitch : Restart openvswitch-db-server container' is running]
I do not know what happens in the meantime but from my perspective the openvswitch-db container was down for nearly two minutes. I try to collect more infos on that, but I wanted to ask if anybody else has faced that issue. For now I could not reproduce that on our dev environment. I will record some more infos on the next productive region upgrade.