lxc / lxc

LXC - Linux Containers
https://linuxcontainers.org/lxc
Other
4.67k stars 1.12k forks source link

lxc profile missing config after physical host was rebooted #4132

Open mullumaus opened 2 years ago

mullumaus commented 2 years ago

Required information

Issue description

Used juju to deploy ovn-chassis charm on lxd container, a lxc profile was created for ovn-chassis container

$ lxc profile show juju-openstack-octavia-ovn-chassis-14 config: linux.kernel_modules: openvswitch description: "" devices: {} name: juju-openstack-octavia-ovn-chassis-14 used_by: /1.0/containers/juju-a79b06-5-lxd-16

After the physical host was rebooted, the config in lxc profile was missing, container didn't load kernel module 'openvswitch'. $ lxc profile show juju-openstack-octavia-ovn-chassis-14 config: {} description: "" devices: {} name: juju-openstack-octavia-ovn-chassis-14 used_by: /1.0/containers/juju-a79b06-5-lxd-16

Steps to reproduce

We are unable to reproduce the issue every time although we have run into the issue more than once.

Information to attach

stgraber commented 2 years ago

@mullumaus you're using LXD 3.0.x which is only supported for security fixes at this point. Can you upgrade to LXD 5.0.x so you are on a version that we actively provide bugfixes for?

stgraber commented 2 years ago

(Note that you can't upgrade to LXD 5.0 directly from 3.0, you'll need to upgrade through 4.0 first)

stgraber commented 2 years ago

It's possible that there's an issue with Dqlite in LXD 3.0 (the old Go implementation of dqlite) which combined with unclean termination of LXD on reboot could cause recent DB transactions to be lost. We've not seen other reports of this, but it's certainly possible.

You may want to look at your systemd log or console output to see if the machine appears to be hanging/waiting for LXD to exit, if it does, chances are things timeout, systemd kills it, potentially causing database issues.

LXD 4.0 has a completely different implementation of dqlite (in C) which uses a completely different way of storing things on disk. Combined with longer timeouts in the systemd units and reworked shutdown handling in LXD, we've never seen a report of the database somehow reverting itself there.

It could also not be a LXD issue at all and be Juju somehow reverting the profile, but there again, getting on a recent LXD will give you better tools to find that out as you'll get the lifecycle events which would then let you easily monitor all changes done to LXD including changes to profiles.