Closed talcoh2x closed 1 year ago
Can you provide the kubevirt vm yaml, and possibly the feature gate configuration that you are running as well?
Name: tacohen-habana-nnwm-c06-vm
Namespace: habana
Labels: habana.ai/is-vmi=true
habana.ai/user=tacohen
kubevirt.io/os=linux
Annotations: container.apparmor.security.beta.kubernetes.io/compute: unconfined
kubevirt.io/latest-observed-api-version: v1
kubevirt.io/storage-observed-api-version: v1alpha3
API Version: kubevirt.io/v1
Kind: VirtualMachine
Metadata:
Creation Timestamp: 2022-09-13T16:42:35Z
Generation: 1
Managed Fields:
API Version: kubevirt.io/v1alpha3
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
f:kubevirt.io/latest-observed-api-version:
f:kubevirt.io/storage-observed-api-version:
f:status:
.:
f:volumeSnapshotStatuses:
Manager: Go-http-client
Operation: Update
Time: 2022-09-13T16:42:36Z
API Version: kubevirt.io/v1alpha3
Fields Type: FieldsV1
fieldsV1:
f:status:
f:conditions:
f:created:
f:printableStatus:
f:ready:
Manager: Go-http-client
Operation: Update
Subresource: status
Time: 2022-09-13T16:43:32Z
Resource Version: 99877899
UID: 8dcd1d32-3fc0-4f5e-9353-29ca8ffc7dcb
Spec:
Run Strategy: RerunOnFailure
Template:
Metadata:
Annotations:
container.apparmor.security.beta.kubernetes.io/compute: unconfined
habana.ai/hlctl-version: 1.2.0
habana.ai/qa.nightly: false
habana.ai/schedulable: false
pod-reaper/max-duration: 4h
Creation Timestamp:
apiVersion: kubevirt.io/v1 kind: KubeVirt metadata: name: kubevirt namespace: kubevirt spec: certificateRotateStrategy: {} configuration: developerConfiguration: featureGates:
infra: nodePlacement: nodeSelector: habana.ai/services: "true" permittedHostDevices: pciHostDevices:
@talcoh2x can you please try running the same with
spec:
domain:
cpu:
dedicatedCpuPlacement: true
isolateEmulatorThread: true
This will isolate the vCPUs from the rest of the process in the compute container and should not interrupt it.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
/close
@kubevirt-bot: Closing this issue.
after updating KubeVirt version from 0.47.1 we can't work at all. creating VMs give us immediately or after 1 min " watchdog: BUG: soft lockup - CPU#75 stuck"
I did a couple of tests: 0.47.1 works well and it starts to fail from 0.48.1 my suspect path is https://github.com/kubevirt/kubevirt/pull/6162 we run VMs with dual NUMA node so maybe we have "starvation" in such cases ??
NOTH: I test it with k8s 1.19, 1.20, 1.21, 1.23 same results
server configuration: 1T GB memory 156 CPU and ~500GB assigned to VMs
Environment:
virtctl version
): N/Akubectl version
): N/Auname -a
): N/A