bug: renewed certificates for libvirt are not reflected

okozachenko1203 commented 1 year ago

Context

We use TLS for vnc session encryption and auth, and for libvirt api. All these TLS certificates are generated per a libvirt pod (issued inside the initContainer) and managed by cert-manager. Cert-manager does its job well so all certs are renewed automatically without a doubt but renewed certs are not reflected to libvirt. So after 90 days (90d is the default cert expiry date in cert-manager), all libvirt pods are stuck because of expired certificates. This is a kinda huge incident because we cannot do any actions on hypervisors because of this issue.

Reason

We fetch the cert content inside the initContainer of libvirt pods by using these scripts https://github.com/vexxhost/atmosphere/blob/main/charts/libvirt/values.yaml#L276-L321 https://github.com/vexxhost/atmosphere/blob/main/charts/libvirt/templates/bin/_libvirt.sh.tpl#L21-L42 As you can see, the cert contents are fetched at the generation time only once then copied to the desired paths. So the renewed certs are not reflected.

Workaround for now

Just rollout the libvirt so the new pods are created and also new certs are issued for them.

Need to find the way to reflect the renewed certs.

okozachenko1203 commented 1 year ago

@mnaser what is your opinion?

mnaser commented 1 year ago

@okozachenko1203 i think there is a CSI for cert-manager but I don't believe this would work well because we need the pod name in order to be able to generate the certificate.

Can cert-manager give us a dynamic certificate based on pod name using the CSI module? If so we can get rid of the initContainer...

okozachenko1203 commented 1 year ago

the bad thing is cert-man csi-driver doesn't support POD_IP

  Warning  FailedMount  35s (x8 over 100s)  kubelet            MountVolume.SetUp failed for volume "api-tls" : rpc error: code = Unknown desc = generating certificate signing request: "csi.cert-manager.io/common-name": undefined variable "POD_IP", known variables: [POD_NAME POD_NAMESPACE POD_UID SERVICE_ACCOUNT_NAME]

mnaser commented 12 months ago

@okozachenko1203 do we need the IP address for the cert? could we start using hostname instead?

guilhermesteinmuller commented 10 months ago

Hi -- I would prioritizer this. We just got a cloud which no vms were being created due to this

mnaser commented 10 months ago

The errors:

| fault                               | {'code': 500, 'created': '2024-01-07T07:37:33Z', 'message': 'Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance d873e0c5-6ba3-4f84-972f-b62041a873ba. Last exception: internal error: process exited while connecting to monitor: 2024-01-07T07:37:27.940999Z qemu-system-x86_64: The serve', 'details': 'Traceback (most recent call last):\n  File "/var/lib/openstack/lib/python3.10/site-packages/nova/conductor/manager.py", line 688, in build_instances\n    scheduler_utils.populate_retry(\n  File "/var/lib/openstack/lib/python3.10/site-packages/nova/scheduler/utils.py", line 998, in populate_retry\n    raise exception.MaxRetriesExceeded(reason=msg)\nnova.exception.MaxRetriesExceeded: Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance d873e0c5-6ba3-4f84-972f-b62041a873ba. Last exception: internal error: process exited while connecting to monitor: 2024-01-07T07:37:27.940999Z qemu-system-x86_64: The server certificate /etc/pki/libvirt-vnc/server-cert.pem has expired\n'} |

mnaser commented 10 months ago

Fixed by https://github.com/vexxhost/atmosphere/pull/865

vexxhost / atmosphere