harvester / harvester

Open source hyperconverged infrastructure (HCI) software
https://harvesterhci.io/
Apache License 2.0
3.75k stars 314 forks source link

[Question] Nested virtualization within Harvester #5243

Closed mjj29 closed 3 months ago

mjj29 commented 6 months ago

Hi there,

Do you support nested virtualization? Can I create a VM within Harvester and then run hyper-v/KVM/VMWare workstation in that VM?

I tried, and it looks like the CPU doesn't have vmx/svm labels and /dev/kvm doesn't exist.

Looking for a replacement for VMWare ESXi, but I need nested virtualization support

Thanks, Matt

mjj29 commented 6 months ago

Looking at https://github.com/harvester/harvester/issues/3617 I checked /sys/module/kvm_amd/parameters/nested on my node and it's =1, so I should be able to get nested virt working, but it doesn't seem to be inside the VM

connorkuehl commented 6 months ago

What's your host CPU, and could you also please share the VM configuration:

Harvester dashboard > Virtual machines > Select the 3 vertical dots on the far right of the VM table > Download YAML or Edit YAML + copy/paste here

edit: also what is the guest OS that you're running?

mjj29 commented 6 months ago

Guest OS is Debian 12 uname: Linux testlnx2 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux cpuinfo from inside the VM:

vendor_id       : AuthenticAMD
model name      : AMD EPYC-Milan Processor
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat umip pku ospke vaes vpclmulqdq rdpid arch_capabilities
bugs            : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass srso

cpuinfo from the host: AMD EPYC 7413

VM config:

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  annotations:
    harvesterhci.io/vmRunStrategy: RerunOnFailure
    harvesterhci.io/volumeClaimTemplates: >-
      [{"metadata":{"name":"testlnx2-disk-0-wwlvm","annotations":{"harvesterhci.io/imageId":"default/image-t7cv6"}},"spec":{"accessModes":["ReadWriteMany"],"resources":{"requests":{"storage":"10Gi"}},"volumeMode":"Block","storageClassName":"longhorn-image-t7cv6"}}]
    kubevirt.io/latest-observed-api-version: v1
    kubevirt.io/storage-observed-api-version: v1
    network.harvesterhci.io/ips: '[]'
  creationTimestamp: '2024-02-28T13:20:20Z'
  finalizers:
    - kubevirt.io/virtualMachineControllerFinalize
    - harvesterhci.io/VMController.UnsetOwnerOfPVCs
  generation: 2
  labels:
    harvesterhci.io/creator: harvester
    harvesterhci.io/os: debian
  managedFields:
    - apiVersion: kubevirt.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            f:kubevirt.io/latest-observed-api-version: {}
            f:kubevirt.io/storage-observed-api-version: {}
          f:finalizers:
            .: {}
            v:"kubevirt.io/virtualMachineControllerFinalize": {}
      manager: Go-http-client
      operation: Update
      time: '2024-02-28T13:20:20Z'
    - apiVersion: kubevirt.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:harvesterhci.io/vmRunStrategy: {}
            f:harvesterhci.io/volumeClaimTemplates: {}
            f:network.harvesterhci.io/ips: {}
          f:finalizers:
            v:"harvesterhci.io/VMController.UnsetOwnerOfPVCs": {}
          f:labels:
            .: {}
            f:harvesterhci.io/creator: {}
            f:harvesterhci.io/os: {}
        f:spec:
          .: {}
          f:runStrategy: {}
          f:template:
            .: {}
            f:metadata:
              .: {}
              f:annotations:
                .: {}
                f:harvesterhci.io/sshNames: {}
              f:labels:
                .: {}
                f:harvesterhci.io/vmName: {}
            f:spec:
              .: {}
              f:accessCredentials: {}
              f:affinity: {}
              f:domain:
                .: {}
                f:cpu:
                  .: {}
                  f:cores: {}
                  f:sockets: {}
                  f:threads: {}
                f:devices:
                  .: {}
                  f:disks: {}
                  f:inputs: {}
                  f:interfaces: {}
                f:features:
                  .: {}
                  f:acpi:
                    .: {}
                    f:enabled: {}
                f:machine:
                  .: {}
                  f:type: {}
                f:resources:
                  .: {}
                  f:limits:
                    .: {}
                    f:cpu: {}
                    f:memory: {}
              f:evictionStrategy: {}
              f:hostname: {}
              f:networks: {}
              f:terminationGracePeriodSeconds: {}
              f:volumes: {}
      manager: harvester
      operation: Update
      time: '2024-02-28T13:20:37Z'
    - apiVersion: kubevirt.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:conditions: {}
          f:created: {}
          f:desiredGeneration: {}
          f:observedGeneration: {}
          f:printableStatus: {}
          f:ready: {}
          f:volumeSnapshotStatuses: {}
      manager: Go-http-client
      operation: Update
      subresource: status
      time: '2024-02-28T13:20:59Z'
  name: testlnx2
  namespace: default
  resourceVersion: '31716'
  uid: e19d7675-922e-4e36-9288-13e2e22b51f0
spec:
  runStrategy: RerunOnFailure
  template:
    metadata:
      annotations:
        harvesterhci.io/sshNames: '["default/testmatj"]'
      creationTimestamp: null
      labels:
        harvesterhci.io/vmName: testlnx2
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: network.harvesterhci.io/mgmt
                    operator: In
                    values:
                      - 'true'
      architecture: amd64
      domain:
        cpu:
          cores: 4
          sockets: 1
          threads: 1
        devices:
          disks:
            - bootOrder: 1
              disk:
                bus: virtio
              name: disk-0
            - disk:
                bus: virtio
              name: cloudinitdisk
          inputs:
            - bus: usb
              name: tablet
              type: tablet
          interfaces:
            - bridge: {}
              macAddress: e2:9a:cc:3c:52:e7
              model: virtio
              name: default
        features:
          acpi:
            enabled: true
        machine:
          type: q35
        memory:
          guest: 8092Mi
        resources:
          limits:
            cpu: '4'
            memory: 8Gi
          requests:
            cpu: 250m
            memory: 5461Mi
      evictionStrategy: LiveMigrate
      hostname: testlnx2
      networks:
        - multus:
            networkName: default/defaultnet
          name: default
      terminationGracePeriodSeconds: 120
      volumes:
        - name: disk-0
          persistentVolumeClaim:
            claimName: testlnx2-disk-0-wwlvm
        - cloudInitNoCloud:
            networkDataSecretRef:
              name: testlnx2-nfqg4
            secretRef:
              name: testlnx2-nfqg4
          name: cloudinitdisk
status:
  conditions:
    - lastProbeTime: null
      lastTransitionTime: '2024-02-28T13:20:35Z'
      status: 'True'
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: null
      status: 'True'
      type: LiveMigratable
    - lastProbeTime: '2024-02-28T13:20:59Z'
      lastTransitionTime: null
      status: 'True'
      type: AgentConnected
  created: true
  desiredGeneration: 2
  observedGeneration: 2
  printableStatus: Running
  ready: true
  volumeSnapshotStatuses:
    - enabled: false
      name: disk-0
      reason: 2 matching VolumeSnapshotClasses for longhorn-image-t7cv6
    - enabled: false
      name: cloudinitdisk
      reason: Snapshot is not supported for this volumeSource type [cloudinitdisk]
connorkuehl commented 6 months ago

Hmm, sorry for some reason I thought that'd give us VirtualMachineInstance info if it was running.

Can you SSH to a node and collect the output of this:

$ kubectl get vmi testlnx2 -n default -o yaml

that should tell us what CPU model libvirt defined the domain with.

mjj29 commented 6 months ago
kind: VirtualMachineInstance
metadata:
  annotations:
    harvesterhci.io/sshNames: '["default/testmatj"]'
    kubevirt.io/latest-observed-api-version: v1
    kubevirt.io/storage-observed-api-version: v1
    kubevirt.io/vm-generation: "2"
  creationTimestamp: "2024-02-28T13:20:20Z"
  finalizers:
  - kubevirt.io/virtualMachineControllerFinalize
  - foregroundDeleteVirtualMachine
  - wrangler.cattle.io/VMIController.UnsetOwnerOfPVCs
  generation: 13
  labels:
    harvesterhci.io/vmName: testlnx2
    kubevirt.io/nodeName: iotperf09
  name: testlnx2
  namespace: default
  ownerReferences:
  - apiVersion: kubevirt.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: VirtualMachine
    name: testlnx2
    uid: e19d7675-922e-4e36-9288-13e2e22b51f0
  resourceVersion: "34187"
  uid: 93faaa57-90ae-4f5d-b7bf-9d0a24d34170
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: network.harvesterhci.io/mgmt
            operator: In
            values:
            - "true"
  architecture: amd64
  domain:
    cpu:
      cores: 4
      model: host-model
      sockets: 1
      threads: 1
    devices:
      disks:
      - bootOrder: 1
        disk:
          bus: virtio
        name: disk-0
      - disk:
          bus: virtio
        name: cloudinitdisk
      inputs:
      - bus: usb
        name: tablet
        type: tablet
      interfaces:
      - bridge: {}
        model: virtio
        name: default
    features:
      acpi:
        enabled: true
    firmware:
      uuid: 27a0266a-862a-5f25-9cdc-f0465094dab5
    machine:
      type: q35
    memory:
      guest: 8092Mi
    resources:
      limits:
        cpu: "4"
        memory: 8Gi
      requests:
        cpu: 250m
        memory: 5461Mi
  evictionStrategy: LiveMigrate
  hostname: testlnx2
  networks:
  - multus:
      networkName: default/defaultnet
    name: default
  terminationGracePeriodSeconds: 120
  volumes:
  - name: disk-0
    persistentVolumeClaim:
      claimName: testlnx2-disk-0-wwlvm
  - cloudInitNoCloud:
      networkDataSecretRef:
        name: testlnx2-nfqg4
      secretRef:
        name: testlnx2-nfqg4
    name: cloudinitdisk
status:
  activePods:
    e473b8a8-69b0-4aa0-8bf1-ff669ad7e163: iotperf09
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-02-28T13:20:35Z"
    status: "True"
mjj29 commented 6 months ago

Hi there, any updates to this? It sounds like you think it should work, but it definitely doesn't for us

connorkuehl commented 6 months ago

Yeah, I am expecting this to work.

harv1:~ # kubectl get vmi test2 -o yaml -n harvester-public | yq .spec.domain.cpu.model
host-model

and then in the guest, I have:

root@test2:~# kvm-ok
INFO: /dev/kvm exists
KVM acceleration can be used

I suppose one more useful data point could be seeing if it still doesn't work with host-passthrough.

Would you be willing to shut down your VM, then on the Virtual Machines page, click the 3 dots again, and click "Edit YAML" and add a "model: host-passthrough" ?

The field is probably missing under the .spec.template.spec.domain.cpu field, but it'll look something like this when you're done:

spec:
  runStrategy: RerunOnFailure
  template:
    metadata:
      annotations:
        harvesterhci.io/sshNames: '["harvester-public/ckuehl-suselaptop"]'
      creationTimestamp: null
      labels:
        harvesterhci.io/vmName: test2
    spec:
      affinity: {}
      domain:
        cpu:
          cores: 1
          model: host-passthrough    # <-------
          sockets: 1
          threads: 1
mjj29 commented 6 months ago

Yes, it looks like that's worked. I have /dev/kvm, svm is listed in my /proc/cpuinfo and libvirtd is started up.

Should I be able to set that in the UI, or will I need to edit all the relevant VMs by hand?

connorkuehl commented 6 months ago

No, the host-passthrough thing was strictly a debugging/fact-finding mission. It can cause a lot of strange and hard to diagnose bugs if the machine is live-migrated within the cluster, unless every machine in the cluster is identical in terms of CPU, microcode, etc. For that reason, I don't recommend using it.

At first I thought it was because the component responsible for generating vCPU definitions (i.e., host-model) didn't have a complete definition for Milan EPYC processors, but it seems to have the definition and it includes the svm and npt flags. I'm continuing to investigate.

When using the default/host-model vCPU, could you collect the output of these commands?

The values below will be different on your system. For example, my virt-launcher pod is called virt-launcher-test3-s72ww but you can find yours with kubectl get pods -n default | grep testlnx2

And the value default_test3 from my examples would be default_testlnx2 for you.

harv1:~ # kubectl exec -it virt-launcher-test3-s72ww -n default -- virsh dumpxml default_test3

and

harv1:~ # kubectl exec -it virt-launcher-test3-s72ww -n default -- cat /var/log/libvirt/qemu/default_test3.log
mjj29 commented 6 months ago
<domain type='kvm' id='1'>
  <name>default_testlnx2</name>
  <uuid>27a0266a-862a-5f25-9cdc-f0465094dab5</uuid>
  <metadata>
    <kubevirt xmlns="http://kubevirt.io">
      <uid/>
    </kubevirt>
  </metadata>
  <memory unit='KiB'>8286208</memory>
  <currentMemory unit='KiB'>8286208</currentMemory>
  <vcpu placement='static'>4</vcpu>
  <iothreads>1</iothreads>
  <sysinfo type='smbios'>
    <system>
      <entry name='manufacturer'>KubeVirt</entry>
      <entry name='product'>None</entry>
      <entry name='uuid'>27a0266a-862a-5f25-9cdc-f0465094dab5</entry>
      <entry name='family'>KubeVirt</entry>
    </system>
  </sysinfo>
  <os>
    <type arch='x86_64' machine='pc-q35-7.1'>hvm</type>
    <smbios mode='sysinfo'/>
  </os>
  <features>
    <acpi/>
  </features>
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>EPYC-Milan</model>
    <vendor>AMD</vendor>
    <topology sockets='1' dies='1' cores='4' threads='1'/>
    <feature policy='require' name='x2apic'/>
    <feature policy='require' name='tsc-deadline'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='tsc_adjust'/>
    <feature policy='require' name='vaes'/>
    <feature policy='require' name='vpclmulqdq'/>
    <feature policy='require' name='spec-ctrl'/>
    <feature policy='require' name='stibp'/>
    <feature policy='require' name='arch-capabilities'/>
    <feature policy='require' name='ssbd'/>
    <feature policy='require' name='cmp_legacy'/>
    <feature policy='require' name='virt-ssbd'/>
    <feature policy='require' name='rdctl-no'/>
    <feature policy='require' name='skip-l1dfl-vmentry'/>
    <feature policy='require' name='mds-no'/>
    <feature policy='require' name='pschange-mc-no'/>
    <feature policy='disable' name='erms'/>
    <feature policy='disable' name='fsrm'/>
    <feature policy='disable' name='svm'/>
    <feature policy='require' name='topoext'/>
    <feature policy='disable' name='npt'/>
    <feature policy='disable' name='nrip-save'/>
    <feature policy='disable' name='svme-addr-chk'/>
  </cpu>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='block' device='disk' model='virtio-non-transitional'>
      <driver name='qemu' type='raw' cache='none' error_policy='stop' io='native' discard='unmap'/>
      <source dev='/dev/disk-0' index='2'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <boot order='1'/>
      <alias name='ua-disk-0'/>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    </disk>
    <disk type='file' device='disk' model='virtio-non-transitional'>
      <driver name='qemu' type='raw' cache='none' error_policy='stop' discard='unmap'/>
      <source file='/var/run/kubevirt-ephemeral-disks/cloud-init-data/default/testlnx2/noCloud.iso' index='1'/>
      <backingStore/>
      <target dev='vdb' bus='virtio'/>
      <alias name='ua-cloudinitdisk'/>
      <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='qemu-xhci'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </controller>
    <controller type='scsi' index='0' model='virtio-non-transitional'>
      <alias name='scsi0'/>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </controller>
    <controller type='virtio-serial' index='0' model='virtio-non-transitional'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <alias name='ide'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'>
      <alias name='pcie.0'/>
    </controller>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x10'/>
      <alias name='pci.1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x11'/>
      <alias name='pci.2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x12'/>
      <alias name='pci.3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x13'/>
      <alias name='pci.4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x14'/>
      <alias name='pci.5'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0x15'/>
      <alias name='pci.6'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x5'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0x16'/>
      <alias name='pci.7'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='8' port='0x17'/>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
    </controller>
    <controller type='pci' index='9' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='9' port='0x18'/>
      <alias name='pci.9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='10' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='10' port='0x19'/>
      <alias name='pci.10'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x1'/>
    </controller>
    <controller type='pci' index='11' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='11' port='0x1a'/>
      <alias name='pci.11'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x2'/>
    </controller>
    <interface type='ethernet'>
      <mac address='e2:9a:cc:3c:52:e7'/>
      <target dev='tap37a8eec1ce1' managed='no'/>
      <model type='virtio-non-transitional'/>
      <mtu size='1500'/>
      <alias name='ua-default'/>
      <rom enabled='no'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
    <serial type='unix'>
      <source mode='bind' path='/var/run/kubevirt-private/8c40715f-c707-484b-b410-63a8004a7bcb/virt-serial0'/>
      <log file='/var/run/kubevirt-private/8c40715f-c707-484b-b410-63a8004a7bcb/virt-serial0-log' append='on'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='unix'>
      <source mode='bind' path='/var/run/kubevirt-private/8c40715f-c707-484b-b410-63a8004a7bcb/virt-serial0'/>
      <log file='/var/run/kubevirt-private/8c40715f-c707-484b-b410-63a8004a7bcb/virt-serial0-log' append='on'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <channel type='unix'>
      <source mode='bind' path='/var/run/kubevirt-private/libvirt/qemu/channel/target/domain-1-default_testlnx2/org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='tablet' bus='usb'>
      <alias name='ua-tablet'/>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'>
      <alias name='input1'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input2'/>
    </input>
    <graphics type='vnc' socket='/var/run/kubevirt-private/8c40715f-c707-484b-b410-63a8004a7bcb/virt-vnc'>
      <listen type='socket' socket='/var/run/kubevirt-private/8c40715f-c707-484b-b410-63a8004a7bcb/virt-vnc'/>
    </graphics>
    <audio id='1' type='none'/>
    <video>
      <model type='vga' vram='16384' heads='1' primary='yes'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </video>
    <memballoon model='virtio-non-transitional' freePageReporting='on'>
      <stats period='10'/>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x0a' slot='0x00' function='0x0'/>
    </memballoon>
  </devices>
</domain>

but I get permission denied on the log file

mjj29 commented 6 months ago

Hey, was that helpful?

connorkuehl commented 6 months ago

Yes, but since the domain XML seems to have the svm and npt flags, I'd really like to see the QEMU logs to see what's going on. The permission denied error is unexpected. Are you running these commands as root logged into a Harvester node?

Failing that, we could try to make do with the QEMU command line (don't forget to substitute your pod name instead of virt-launcher-abc-7rjn4):

harvester-zzz8w:~ # kubectl exec -it virt-launcher-abc-7rjn4 -n default -- pgrep qemu
78
harvester-zzz8w:~ # kubectl exec -it virt-launcher-abc-7rjn4 -n default -- cat /proc/78/cmdline
mjj29 commented 6 months ago

Here's the command line, lets try and access the logs

/usr/bin/qemu-system-x86_64
-name
guest=default_testlnx2,debug-threads=on
-S
-object
{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/run/kubevirt-private/libvirt/qemu/lib/domain-1-default_testlnx2/master-key.aes"}
-machine
pc-q35-7.1,usb=off,dump-guest-core=off,memory-backend=pc.ram
-accel
kvm
-cpu
EPYC-Milan,x2apic=on,tsc-deadline=on,hypervisor=on,tsc-adjust=on,vaes=on,vpclmulqdq=on,spec-ctrl=on,stibp=on,arch-capabilities=on,ssbd=on,cmp-legacy=on,virt-ssbd=on,rdctl-no=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,erms=off,fsrm=off
-m
8092
-object
{"qom-type":"memory-backend-ram","id":"pc.ram","size":8485076992}
-overcommit
mem-lock=off
-smp
4,sockets=1,dies=1,cores=4,threads=1
-object
{"qom-type":"iothread","id":"iothread1"}
-uuid
27a0266a-862a-5f25-9cdc-f0465094dab5
-smbios
type=1,manufacturer=KubeVirt,product=None,uuid=27a0266a-862a-5f25-9cdc-f0465094dab5,family=KubeVirt
-no-user-config
-nodefaults
-chardev
socket,id=charmonitor,fd=20,server=on,wait=off
-mon
chardev=charmonitor,id=monitor,mode=control
-rtc
base=utc
-no-shutdown
-boot
strict=on
-device
{"driver":"pcie-root-port","port":16,"chassis":1,"id":"pci.1","bus":"pcie.0","multifunction":true,"addr":"0x2"}
-device
{"driver":"pcie-root-port","port":17,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x2.0x1"}
-device
{"driver":"pcie-root-port","port":18,"chassis":3,"id":"pci.3","bus":"pcie.0","addr":"0x2.0x2"}
-device
{"driver":"pcie-root-port","port":19,"chassis":4,"id":"pci.4","bus":"pcie.0","addr":"0x2.0x3"}
-device
{"driver":"pcie-root-port","port":20,"chassis":5,"id":"pci.5","bus":"pcie.0","addr":"0x2.0x4"}
-device
{"driver":"pcie-root-port","port":21,"chassis":6,"id":"pci.6","bus":"pcie.0","addr":"0x2.0x5"}
-device
{"driver":"pcie-root-port","port":22,"chassis":7,"id":"pci.7","bus":"pcie.0","addr":"0x2.0x6"}
-device
{"driver":"pcie-root-port","port":23,"chassis":8,"id":"pci.8","bus":"pcie.0","addr":"0x2.0x7"}
-device
{"driver":"pcie-root-port","port":24,"chassis":9,"id":"pci.9","bus":"pcie.0","multifunction":true,"addr":"0x3"}
-device
{"driver":"pcie-root-port","port":25,"chassis":10,"id":"pci.10","bus":"pcie.0","addr":"0x3.0x1"}
-device
{"driver":"pcie-root-port","port":26,"chassis":11,"id":"pci.11","bus":"pcie.0","addr":"0x3.0x2"}
-device
{"driver":"qemu-xhci","id":"usb","bus":"pci.5","addr":"0x0"}
-device
{"driver":"virtio-scsi-pci-non-transitional","id":"scsi0","bus":"pci.6","addr":"0x0"}
-device
{"driver":"virtio-serial-pci-non-transitional","id":"virtio-serial0","bus":"pci.7","addr":"0x0"}
-blockdev
{"driver":"host_device","filename":"/dev/disk-0","aio":"native","node-name":"libvirt-2-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}
-blockdev
{"node-name":"libvirt-2-format","read-only":false,"discard":"unmap","cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-2-storage"}
-device
{"driver":"virtio-blk-pci-non-transitional","bus":"pci.8","addr":"0x0","drive":"libvirt-2-format","id":"ua-disk-0","bootindex":1,"write-cache":"on","werror":"stop","rerror":"stop"}
-blockdev
{"driver":"file","filename":"/var/run/kubevirt-ephemeral-disks/cloud-init-data/default/testlnx2/noCloud.iso","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}
-blockdev
{"node-name":"libvirt-1-format","read-only":false,"discard":"unmap","cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}
-device
{"driver":"virtio-blk-pci-non-transitional","bus":"pci.9","addr":"0x0","drive":"libvirt-1-format","id":"ua-cloudinitdisk","write-cache":"on","werror":"stop","rerror":"stop"}
-netdev
tap,fd=21,vhost=on,vhostfd=23,id=hostua-default
-device
{"driver":"virtio-net-pci-non-transitional","host_mtu":1500,"netdev":"hostua-default","id":"ua-default","mac":"e2:9a:cc:3c:52:e7","bus":"pci.1","addr":"0x0","romfile":""}
-add-fd
set=0,fd=19,opaque=serial0-log
-chardev
socket,id=charserial0,fd=17,server=on,wait=off,logfile=/dev/fdset/0,logappend=on
-device
{"driver":"isa-serial","chardev":"charserial0","id":"serial0","index":0}
-chardev
socket,id=charchannel0,fd=18,server=on,wait=off
-device
{"driver":"virtserialport","bus":"virtio-serial0.0","nr":1,"chardev":"charchannel0","id":"channel0","name":"org.qemu.guest_agent.0"}
-device
{"driver":"usb-tablet","id":"ua-tablet","bus":"usb.0","port":"1"}
-audiodev
{"id":"audio1","driver":"none"}
-vnc
vnc=unix:/var/run/kubevirt-private/8c40715f-c707-484b-b410-63a8004a7bcb/virt-vnc,audiodev=audio1
-device
{"driver":"VGA","id":"video0","vgamem_mb":16,"bus":"pcie.0","addr":"0x1"}
-device
{"driver":"virtio-balloon-pci-non-transitional","id":"balloon0","free-page-reporting":true,"bus":"pci.10","addr":"0x0"}
-sandbox
on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny
-msg
timestamp=on
mjj29 commented 6 months ago

For the logs, you gave me a kubectl exec command to run inside the pod, which gives me permission denied, wherever the kubectl is running. I do have root on the node, but where would the log be outside of kubectl

mjj29 commented 6 months ago

Hi there, any updates?

connorkuehl commented 6 months ago

Yes, something interesting is that the svm and npt flags were present in the libvirt domain XML, but are not included in the QEMU command line:

  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>EPYC-Milan</model>
    <vendor>AMD</vendor>
    <topology sockets='1' dies='1' cores='4' threads='1'/>
    <feature policy='require' name='x2apic'/>
    <feature policy='require' name='tsc-deadline'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='tsc_adjust'/>
    <feature policy='require' name='vaes'/>
    <feature policy='require' name='vpclmulqdq'/>
    <feature policy='require' name='spec-ctrl'/>
    <feature policy='require' name='stibp'/>
    <feature policy='require' name='arch-capabilities'/>
    <feature policy='require' name='ssbd'/>
    <feature policy='require' name='cmp_legacy'/>
    <feature policy='require' name='virt-ssbd'/>
    <feature policy='require' name='rdctl-no'/>
    <feature policy='require' name='skip-l1dfl-vmentry'/>
    <feature policy='require' name='mds-no'/>
    <feature policy='require' name='pschange-mc-no'/>
    <feature policy='disable' name='erms'/>
    <feature policy='disable' name='fsrm'/>
    <feature policy='disable' name='svm'/>
    <feature policy='require' name='topoext'/>
    <feature policy='disable' name='npt'/>
    <feature policy='disable' name='nrip-save'/>
    <feature policy='disable' name='svme-addr-chk'/>
  </cpu>
EPYC-Milan,x2apic=on,tsc-deadline=on,hypervisor=on,tsc-adjust=on,vaes=on,vpclmulqdq=on,spec-ctrl=on,stibp=on,arch-capabilities=on,ssbd=on,cmp-legacy=on,virt-ssbd=on,rdctl-no=on,skip-l1dfl-vmentry=on,mds-no=on,pschange-mc-no=on,erms=off,fsrm=off

I haven't had a chance to investigate further, but I'll update here when I do.

Which version of Harvester are you using? If you have the appetite to set up a Harvester node on the recently released 1.3.0 and repeat the experiment, that would be a valuable data point, as it ships with newer versions of KubeVirt (+ libvirt and QEMU).

If it works out of the box, then I suspect there's possibly a bug in the libvirt that we ship with the KubeVirt operator container image. Or, if libvirt "negotiates" with the QEMU build I wonder if there's some sort of disagreement before it renders the command line, but I'm just speculating at this point.

mjj29 commented 6 months ago

I'm on the final preview release of 1.3 because the earlier releases wouldn't install on my hardware

bk201 commented 6 months ago

@mjj29 nested virtualization is not supported due to performance issues.

mjj29 commented 6 months ago

@mjj29 nested virtualization is not supported due to performance issues.

That's a surprising comment, given that connorkuehl seems to have it working. Can you go into more detail? What about using host-passthrough where it does seem to work for me (I have a cluster of 10 identical 96 core machines to run this on, so no worries about different CPU microcodes or anything)

bk201 commented 6 months ago

Hi @mjj29 Sorry if it's not clear. Nested guests work and we use it in development. But we can't recommend the usage because of performance issues. I see you have beefy machines, but not everyone has them :).

Harvester uses SLES as the base OS and its packages, from the enterprise service perspective (please check the limitations) we can't support the use case.

mjj29 commented 6 months ago

OK, that's more clear. We do have beefy machines, and do need nested virtualization. Given that your colleague seems to believe it should work - and I assume you're at least somewhat interested in it working, can you give me any help to work out why it's not working for me? I'll definitely bear in mind the supported status. This will be part of the decision on whether we migrate from vmware to harvester, or to Xen (which does support it)

github-actions[bot] commented 4 months ago

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.