osism / issues

This repository is used for bug reports that are cross-project or not bound to a specific repository (or to an unknown repository).
https://www.osism.tech
1 stars 1 forks source link

kolla containers are down for a lot of time during upgrade #973

Closed maliblatt closed 6 months ago

maliblatt commented 6 months ago

We recognized at the last upgrade to OSISM7 (but maybe already with the last Update) that a lot of kolla containers are having a very long downtime in the upgrade process. We saw this behavior in several roles, for example in the common role. That would not be a big deal, but this long downtimes can also be found with openvswitch and ovn-controllers. That leads to a partially network connectivity outgage of the instances for a (unneccessary?) long time. All docker images are pre-pulled so that this should not be the reason for the long downtime.

see here: 2024-04-15 09:35:41,458 p=12672 u=dragon n=ansible | TASK [openvswitch : Flush Handlers] 2024-04-15 09:35:42,953 p=12672 u=dragon n=ansible | RUNNING HANDLER [openvswitch : Restart openvswitch-db-server container] 2024-04-15 09:37:45,581 p=12672 u=dragon n=ansible | STILL ALIVE [task 'openvswitch : Restart openvswitch-db-server container' is running]

I do not know what happens in the meantime but from my perspective the openvswitch-db container was down for nearly two minutes. I try to collect more infos on that, but I wanted to ask if anybody else has faced that issue. For now I could not reproduce that on our dev environment. I will record some more infos on the next productive region upgrade.

berendt commented 6 months ago

This is the related code: https://github.com/openstack/kolla-ansible/blob/master/ansible/module_utils/kolla_docker_worker.py#L360-L382

            self.stop_container()
            self.remove_container()
            self.start_container()

The process is made up of 3 phases. The runtime of the task itself does not allow any specific conclusions to be drawn about the actual unavailability of the container. We would first have to find out whether the problem is caused by a certain number of nodes being processed simultaneously and whether the use of throttling improves the runtime on the manager itself.

berendt commented 6 months ago

Maybe related: https://bugs.launchpad.net/kolla-ansible/+bug/2048130

# cat kolla-cron-container.service
# kolla-cron-container.service
# autogenerated by Kolla-Ansible

[Unit]
Description=docker kolla-cron-container.service
After=docker.service
Requires=docker.service
StartLimitIntervalSec=120
StartLimitBurst=10

[Service]
ExecStart=/usr/bin/docker start -a cron
ExecStop=/usr/bin/docker stop cron -t 60
Restart=always
RestartSec=11

[Install]
WantedBy=multi-user.target

The list of affected services matches our observations:

kolla-cron-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:02:06 UTC; 30s ago
kolla-designate_producer-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:06:32 UTC; 30s ago
kolla-keystone_fernet-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:15:44 UTC; 30s ago
kolla-letsencrypt_lego-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:17:46 UTC; 30s ago
kolla-magnum_api-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:19:07 UTC; 30s ago
kolla-mariadb_clustercheck-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:21:18 UTC; 30s ago
kolla-neutron_l3_agent-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:24:08 UTC; 30s ago
kolla-openvswitch_db-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:34:26 UTC; 30s ago
kolla-openvswitch_vswitchd-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:35:06 UTC; 30s ago
kolla-proxysql-container.service > Active: failed (Result: exit-code) since Thu 2024-01-04 19:36:28 UTC; 30s ago
berendt commented 6 months ago

The patch (https://review.opendev.org/c/openstack/kolla-ansible/+/904805) is back ported to stable/2023.1 & stable/2023.2. But it's not part of OSISM 6.0.2.

Possible solution: write a play that adds SuccessExitStatus=143 to all relevant units and reloads systemd. This play has to run before the Kolla upgrade.

berendt commented 6 months ago

@maliblatt Can you add this play to environments/custom/playbook-fix-973.yml and test it on a single node with osism apply fix-973 -l NODE. This will add the bug fix from https://bugs.launchpad.net/kolla-ansible/+bug/2048130 to all required unit files and will reload systemd afterwards. This should fix the issue. I will run a test in the evening.

Play will be included in 7.0.3.

---
- name: Fix for osism/issues#973
  hosts: "{{ hosts_fix_973|default('common') }}"

  vars:
    unit_files:
      - kolla-cron-container.service
      - kolla-designate_producer-container.service
      - kolla-keystone_fernet-container.service
      - kolla-letsencrypt_lego-container.service
      - kolla-magnum_api-container.service
      - kolla-mariadb_clustercheck-container.service
      - kolla-neutron_l3_agent-container.service
      - kolla-openvswitch_db-container.service
      - kolla-openvswitch_vswitchd-container.service
      - kolla-proxysql-container.service

  tasks:
    - name: Check the unit files to be repaired
      ansible.builtin.stat:
        path: "/etc/systemd/system/{{ item }}"
      loop: "{{ unit_files }}"
      register: result

    - name: Repair unit file
      become: true
      ansible.builtin.lineinfile:
        path: "/etc/systemd/system/{{ item.item }}"
        insertafter: "^RestartSec="
        line: "SuccessExitStatus=143"
      loop: "{{ result.results }}"
      when: item["stat"].exists|bool

    - name: Reload systemd daemon
      become: true
      ansible.builtin.systemd:
        daemon_reload: true
berendt commented 6 months ago

Keep open until everything is tested.

maliblatt commented 6 months ago

I have tested the patch successfully! With the patched systemd files the container restarts are as fast as expected and we do not see the long downtime any longer! I have tested the upgrade on two hosts in the same environment, one patched and one non-patched.

On the non-patched host we could see the downtime of around 2 minutes clearly for each affected container, which leads to a huge amount of packet loss for the running instances. see: 867aa91865b6 quay.io/osism/openvswitch-vswitchd:3.1.2.20230919 "dumb-init --single-…" 5 months ago Exited (143) 2 minutes ago openvswitch_vswitchd

On the previously patched host (with osism apply fix-973 -l NODE) there was no downtime visible, the container were restarting immediately again.

thanks a lot for the quick fix!

berendt commented 6 months ago

Also tested. Looks good.