Fatal glibc error: CPU does not support x86-64-v2

kvaps commented 1 year ago

What happened: CDI not working on Common KVM processor

What you expected to happen: CDI able to run in any environment

How to reproduce it (as minimally and precisely as possible):

Create VM with kvm64 processor. This is default mode in many hypervisors, eg. proxmox:
Try to build and install CDI

Additional context: Add any other context about the problem here.

Environment:

CDI version (use kubectl get deployments cdi-deployment -o yaml): 1.55.2
Kubernetes version (use kubectl version): v1.23.16
DV specification: N/A
Cloud provider or hardware configuration: qemu-kvm virtual machines
OS (e.g. from /etc/os-release): Ubuntu 22.04.1 LTS
Kernel (e.g. uname -a): 5.15.0-67-generic
Install tools: deckhouse
Others: N/A

alromeros commented 1 year ago

Hi @kvaps, thanks for opening this issue

I've been doing some research and this seems to be a somewhat common issue in some virtualization platforms with guest OS with high CPU requirements (usually, CPUs compatible with x86-64-v2).

Quoting a case with the same error in a RHEL9 guest:

The Virtualization Platform is likely hiding CPU features to enable live migration compatibility between different hypervisor machines with different CPU models, it needs to be raised to a newer minimum set of features that is compatible with x86-64-v2.

The Virtualization platforms often allow configuring a minimum denominator CPU model, which is a subset or equal to all physical hosts actual CPUs. This is in order to hide features from the VMs so that they can live migrate between hosts with different CPU models, so that the VM does not suddenly misses CPU features when migrating. This setting is likely set to a value too low, below the minimum requirements of RHEL9.

I've seen several other people having this issue with kvm6. The problem seems to be solved when changing the CPU to match the host:

https://forum.proxmox.com/threads/kernel-panic-when-creating-vms-centos-9-stream-iso.104656/ https://github.com/ansible/awx/issues/11879 https://forums.centos.org/viewtopic.php?t=78733

Knowing this, I would not consider this to be a bug in CDI, and can probably be solved by tweaking that specific configuration in proxmox.

Hope this helps! Let's see if some other people with more expertise in this topic can help too.

kvaps commented 1 year ago

This is not only proxmox problem, I have the same problem with OpenStack VMs. The CDI does not work in virtual machines if cpu is not set to host-cpu.

alromeros commented 1 year ago

The problem is common among several virtualization platforms, including OpenStack. Supposedly, to improve live migration between hosts with different CPU models.

In the case of OpenStack Virtualization, general CPU options are set in the [libvirt] group in /etc/nova/nova.conf. It seems that you can find a workaround by tweaking cpu_mode, cpu_model and cpu_model_extra_flags (documentation), but I've yet to find the exact values required. Setting cpu_mode to host-model seems to be the way to go in most cases.

Also, from qemu's documentation: https://www.qemu.org/docs/master/system/i386/cpu.html#other-non-recommended-x86-cpus

kvaps commented 1 year ago

It seems it is the common problem of rhel9 https://access.redhat.com/solutions/6833751

alromeros commented 1 year ago

Right... it seems to be a problem in all OS with mandatory x86-64-v2 support. Since all the workarounds suggest changing the CPU to host, it simply looks like most virtualization platforms won't support kvm64 in OS with specific CPU requirements. That said, I don't think we can do much there.

kvaps commented 1 year ago

Why other KubeVirt workloads work then? Afaik they are also based on centos stream9. What is diferent? -static build? -glibc?

alromeros commented 1 year ago

Interesting, I'll do some research about that. Can you specify the workloads and their versions?

kvaps commented 1 year ago

Sure, here is my testing workflow:

create the vm:

apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
name: testvmi-nocloud
spec:
terminationGracePeriodSeconds: 30
domain:
cpu:
  model: kvm64
resources:
  requests:
    memory: 1024M
devices:
  disks:
  - name: containerdisk
    disk:
      bus: virtio
  - disk:
      bus: virtio
    name: cloudinitdisk
volumes:
- name: containerdisk
containerDisk:
  image: kubevirt/fedora-cloud-container-disk-demo:latest
- name: cloudinitdisk
cloudInitNoCloud:
  userData: |-
    #cloud-config
    password: fedora
    chpasswd: { expire: False }

Enable kvm64 processor:

kubectl label node --all cpu-model.node.kubevirt.io/kvm64=true

login to the vm:
```
virtctl console testvmi-nocloud
```
Install docker

Try to run cdi-operator and virt-opeartor with latest versions:

# docker run -ti --rm quay.io/kubevirt/cdi-operator:v1.56.0
Fatal glibc error: CPU does not support x86-64-v2
# docker run -ti --rm quay.io/kubevirt/virt-operator:v0.59.0
{"component":"virt-operator","level":"info","msg":"cannot find virt-operator's image\n","pos":"application.go:148","timestamp":"2023-04-10T11:55:12.530095Z"}

However virt-handler does not work as well:

# docker run -ti --rm quay.io/kubevirt/virt-handler:v0.59.0
Fatal glibc error: CPU does not support x86-64-v2

mhenriks commented 1 year ago

@kvaps

Yeah looks like kubevirt is setting static = "on" like here:

https://github.com/kubevirt/kubevirt/blob/main/cmd/virt-operator/BUILD.bazel#L19

We can maybe do that as well for some containers

akalenyu commented 1 year ago

I think this tails all the way down to QEMU (where it got fixed and backported): https://github.com/containers/podman/issues/15456#issuecomment-1398083220 https://gitlab.com/qemu-project/qemu/-/commit/d135f781405f7c78153aa65e0327b05a4aa72e50

Of course ignore this if qemu in your environment already contains this

kubevirt-bot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

alromeros commented 1 year ago

/remove-lifecycle stale

alromeros commented 1 year ago

I think this tails all the way down to QEMU (where it got fixed and backported): containers/podman#15456 (comment) https://gitlab.com/qemu-project/qemu/-/commit/d135f781405f7c78153aa65e0327b05a4aa72e50

Of course ignore this if qemu in your environment already contains this

@kvaps did this solve the issue? Did you check if your qemu contains this fix?

alromeros commented 11 months ago

I think it's safe to close the issue since the fix has been addressed in the qemu layer (https://github.com/kubevirt/containerized-data-importer/issues/2652#issuecomment-1502426833). Feel free to reopen if necessary, thanks!

kubevirt / containerized-data-importer

Fatal glibc error: CPU does not support x86-64-v2 #2652