kubevirt / kubevirt

Kubernetes Virtualization API and runtime in order to define and manage virtual machines.
https://kubevirt.io
Apache License 2.0
5.6k stars 1.33k forks source link

CPU stuck on 100% #8443

Closed talcoh2x closed 1 year ago

talcoh2x commented 2 years ago

after updating KubeVirt version from 0.47.1 we can't work at all. creating VMs give us immediately or after 1 min " watchdog: BUG: soft lockup - CPU#75 stuck"

I did a couple of tests: 0.47.1 works well and it starts to fail from 0.48.1 my suspect path is https://github.com/kubevirt/kubevirt/pull/6162 we run VMs with dual NUMA node so maybe we have "starvation" in such cases ??

NOTH: I test it with k8s 1.19, 1.20, 1.21, 1.23 same results

server configuration: 1T GB memory 156 CPU and ~500GB assigned to VMs

image image

Environment:

usrbinkat commented 2 years ago

Can you provide the kubevirt vm yaml, and possibly the feature gate configuration that you are running as well?

talcoh2x commented 2 years ago

Name: tacohen-habana-nnwm-c06-vm Namespace: habana Labels: habana.ai/is-vmi=true habana.ai/user=tacohen kubevirt.io/os=linux Annotations: container.apparmor.security.beta.kubernetes.io/compute: unconfined kubevirt.io/latest-observed-api-version: v1 kubevirt.io/storage-observed-api-version: v1alpha3 API Version: kubevirt.io/v1 Kind: VirtualMachine Metadata: Creation Timestamp: 2022-09-13T16:42:35Z Generation: 1 Managed Fields: API Version: kubevirt.io/v1alpha3 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: f:kubevirt.io/latest-observed-api-version: f:kubevirt.io/storage-observed-api-version: f:status: .: f:volumeSnapshotStatuses: Manager: Go-http-client Operation: Update Time: 2022-09-13T16:42:36Z API Version: kubevirt.io/v1alpha3 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:created: f:printableStatus: f:ready: Manager: Go-http-client Operation: Update Subresource: status Time: 2022-09-13T16:43:32Z Resource Version: 99877899 UID: 8dcd1d32-3fc0-4f5e-9353-29ca8ffc7dcb Spec: Run Strategy: RerunOnFailure Template: Metadata: Annotations: container.apparmor.security.beta.kubernetes.io/compute: unconfined habana.ai/hlctl-version: 1.2.0 habana.ai/qa.nightly: false habana.ai/schedulable: false pod-reaper/max-duration: 4h Creation Timestamp: Labels: habana.ai/schedulable: false habana.ai/user: tacohen Service: tacohen-habana-nnwm-c06-service Vmi: tacohen-habana-nnwm-c06-vm Spec: Affinity: Node Affinity: Required During Scheduling Ignored During Execution: Node Selector Terms: Match Expressions: Key: kubernetes.io/hostname Operator: In Values: hls2-srv65-c06e-kfs Pod Anti Affinity: Required During Scheduling Ignored During Execution: Label Selector: Match Expressions: Key: habana.ai/is-container Operator: In Values: true Topology Key: kubernetes.io/hostname Domain: Cpu: Cores: 39 Dedicated Cpu Placement: true Model: host-passthrough Sockets: 2 Threads: 2 Devices: Block Multi Queue: true Disks: Dedicated IO Thread: true Disk: Bus: virtio Name: localdisk Disk: Bus: virtio Name: cloud-init Disk: Bus: virtio Name: app-config-disk Serial: kubedisk Filesystems: Name: disk0 Virtiofs: Name: disk1 Virtiofs: Gpus: Device Name: habana.ai/gaudi Name: gpu0 Device Name: habana.ai/gaudi Name: gpu1 Device Name: habana.ai/gaudi Name: gpu2 Device Name: habana.ai/gaudi Name: gpu3 Device Name: habana.ai/gaudi Name: gpu4 Device Name: habana.ai/gaudi Name: gpu5 Device Name: habana.ai/gaudi Name: gpu6 Device Name: habana.ai/gaudi Name: gpu7 Interfaces: Name: sriov-net Sriov: Name: sriov-net1 Sriov: Name: sriov-net2 Sriov: Name: sriov-net3 Sriov: Name: sriov-net4 Sriov: Network Interface Multiqueue: true Io Threads Policy: auto Machine: Type: q35 Memory: Guest: 480Gi Resources: Requests: Memory: 520Gi Networks: Multus: Network Name: habana/sriov-net Name: sriov-net Multus: Network Name: habana/sriov-net-x5 Name: sriov-net1 Multus: Network Name: habana/sriov-net-x5 Name: sriov-net2 Multus: Network Name: habana/sriov-net-x5 Name: sriov-net3 Multus: Network Name: habana/sriov-net-x5 Name: sriov-net4 Node Selector: habana.ai/qa.nightly: false habana.ai/schedulable: false Scheduler Name: most-allocated-scheduler Termination Grace Period Seconds: 0 Volumes: Config Map: Name: kube-config Name: app-config-disk Name: localdisk Persistent Volume Claim: Claim Name: tacohen-habana-nnwm-c06-pvc Name: disk0 Persistent Volume Claim: Claim Name: ccache-volume-pvc Name: disk1 Persistent Volume Claim: Claim Name: hostname-volume-pvc Cloud Init No Cloud: User Data: #cloud-config hostname: tacohen-habana-nnwm-c06-vm

talcoh2x commented 2 years ago

apiVersion: kubevirt.io/v1 kind: KubeVirt metadata: name: kubevirt namespace: kubevirt spec: certificateRotateStrategy: {} configuration: developerConfiguration: featureGates:

vladikr commented 2 years ago

@talcoh2x can you please try running the same with

spec:
  domain:
    cpu:
      dedicatedCpuPlacement: true
      isolateEmulatorThread: true

This will isolate the vCPUs from the rest of the process in the compute container and should not interrupt it.

kubevirt-bot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubevirt-bot commented 1 year ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

kubevirt-bot commented 1 year ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

/close

kubevirt-bot commented 1 year ago

@kubevirt-bot: Closing this issue.

In response to [this](https://github.com/kubevirt/kubevirt/issues/8443#issuecomment-1431139283): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.