Open maheshd2 opened 3 years ago
Can you provide your manifests configuration (or link to your git repo) ? Also , can you show error logs that reflect taht?
@yrobla I have captured the issue details in the following git issue: https://github.com/shashiksingh/rhocp-castor/issues/143
A get a 404 error there
Please note that vfio_pci should have been loaded if the policy's deviceType is vfio-pci. Suspect you have an incorrectly configured SriovNetworkNodePolicy.
That repository that you link to is a private repo, not everybody has access. Please help us by not opening the same issue in two different places.
@novacain1 I will have a look at the configurations and try again. Thanks.
For the record purpose I'm adding the failure messages here,
We are seeing this issue when we deploy the CU workload manifest (with a dummy os image instead of actual image) Currently we are seeing this error on the intel cards. Earlier we saw this issue on the mellanox cards and then we tried with intel cards.
2021-03-11T05:37:42.843485209+00:00 stderr F {"component":"virt-launcher","level":"error","msg":"unsupported configuration: host doesn't support passthrough of host PCI devices","pos":"qemuHostdevPreparePCIDevicesCheckSupport:187","subcomponent":"libvirt","thread":"22","timestamp":"2021-03-11T05:37:42.843000Z"}
2021-03-11T05:37:42.843853049+00:00 stderr F {"component":"virt-launcher","level":"error","msg":"unsupported configuration: pci backend driver 'default' is not supported","pos":"virHostdevGetPCIHostDevice:253","subcomponent":"libvirt","thread":"22","timestamp":"2021-03-11T05:37:42.843000Z"}
2021-03-11T05:37:42.843860735+00:00 stderr F {"component":"virt-launcher","level":"error","msg":"Failed to allocate PCI device list: unsupported configuration: pci backend driver 'default' is not supported","pos":"virHostdevReAttachPCIDevices:1089","subcomponent":"libvirt","thread":"22","timestamp":"2021-03-11T05:37:42.843000Z"}
2021-03-11T05:37:42.844278935+00:00 stderr F {"component":"virt-launcher","kind":"","level":"error","msg":"Starting the VirtualMachineInstance failed.","name":"vm-ldc1-vcu1-mdn","namespace":"altiostar-4g-cu-ldc1","pos":"manager.go:1446","reason":"virError(Code=67, Domain=10, Message='unsupported configuration: host doesn't support passthrough of host PCI devices')","timestamp":"2021-03-11T05:37:42.844230Z","uid":"f1e58550-f941-4afd-a4ca-a5a3582472c2"}
2021-03-11T05:37:42.844312955+00:00 stderr F {"component":"virt-launcher","kind":"","level":"error","msg":"Failed to sync vmi","name":"vm-ldc1-vcu1-mdn","namespace":"altiostar-4g-cu-ldc1","pos":"server.go:161","reason":"virError(Code=67, Domain=10, Message='unsupported configuration: host doesn't support passthrough of host PCI devices')","timestamp":"2021-03-11T05:37:42.844287Z","uid":"f1e58550-f941-4afd-a4ca-a5a3582472c2"}
I could see virtualization is enabled:
[root@cudutwo-worker-1 containers]#
[root@cudutwo-worker-1 containers]# dmesg | grep -e DMAR -e IOMMU
[ 0.000000] ACPI: DMAR 0x000000006FC0E000 000260 (v01 DELLOE DELLOSE 00000001 DELL 00000001)
[ 0.000000] DMAR: IOMMU enabled
[ 0.002004] DMAR: Host address width 46
[ 0.003003] DMAR: DRHD base: 0x000000d37fc000 flags: 0x0
[ 0.004007] DMAR: dmar0: reg_base_addr d37fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 0.005003] DMAR: DRHD base: 0x000000e0ffc000 flags: 0x0
[ 0.006006] DMAR: dmar1: reg_base_addr e0ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 0.007003] DMAR: DRHD base: 0x000000ee7fc000 flags: 0x0
[ 0.008005] DMAR: dmar2: reg_base_addr ee7fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 0.009003] DMAR: DRHD base: 0x000000fbffc000 flags: 0x0
[ 0.010005] DMAR: dmar3: reg_base_addr fbffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 0.011003] DMAR: DRHD base: 0x000000aaffc000 flags: 0x0
[ 0.012007] DMAR: dmar4: reg_base_addr aaffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 0.013003] DMAR: DRHD base: 0x000000b87fc000 flags: 0x0
[ 0.014005] DMAR: dmar5: reg_base_addr b87fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 0.015003] DMAR: DRHD base: 0x000000c5ffc000 flags: 0x0
[ 0.016005] DMAR: dmar6: reg_base_addr c5ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 0.017003] DMAR: DRHD base: 0x0000009d7fc000 flags: 0x1
[ 0.018005] DMAR: dmar7: reg_base_addr 9d7fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 0.019003] DMAR: RMRR base: 0x0000006a6f9000 end: 0x0000006abf8fff
[ 0.020003] DMAR: RMRR base: 0x0000006ef5f000 end: 0x0000006ef61fff
[ 0.021003] DMAR: ATSR flags: 0x0
[ 0.022003] DMAR: ATSR flags: 0x0
[ 0.023004] DMAR-IR: IOAPIC id 12 under DRHD base 0xc5ffc000 IOMMU 6
[ 0.024003] DMAR-IR: IOAPIC id 11 under DRHD base 0xb87fc000 IOMMU 5
[ 0.025003] DMAR-IR: IOAPIC id 10 under DRHD base 0xaaffc000 IOMMU 4
[ 0.026003] DMAR-IR: IOAPIC id 18 under DRHD base 0xfbffc000 IOMMU 3
[ 0.027003] DMAR-IR: IOAPIC id 17 under DRHD base 0xee7fc000 IOMMU 2
[ 0.028003] DMAR-IR: IOAPIC id 16 under DRHD base 0xe0ffc000 IOMMU 1
[ 0.029003] DMAR-IR: IOAPIC id 15 under DRHD base 0xd37fc000 IOMMU 0
[ 0.030003] DMAR-IR: IOAPIC id 8 under DRHD base 0x9d7fc000 IOMMU 7
[ 0.031003] DMAR-IR: IOAPIC id 9 under DRHD base 0x9d7fc000 IOMMU 7
[ 0.033003] DMAR-IR: HPET id 0 under DRHD base 0x9d7fc000
[ 0.034003] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[ 0.037209] DMAR-IR: Enabled IRQ remapping in x2apic mode
[ 9.947562] DMAR: dmar6: Using Queued invalidation
[ 9.952665] DMAR: dmar5: Using Queued invalidation
[ 9.957766] DMAR: dmar4: Using Queued invalidation
[ 9.962867] DMAR: dmar3: Using Queued invalidation
[ 9.967965] DMAR: dmar2: Using Queued invalidation
[ 9.973069] DMAR: dmar1: Using Queued invalidation
[ 9.978171] DMAR: dmar7: Using Queued invalidation
[ 11.620354] DMAR: Intel(R) Virtualization Technology for Directed I/O
[root@cudutwo-worker-1 containers]# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-7a643d53f758c7980ee36d1e171c8396546f82ff80e3e22b1af4b9b36ed24f61/vmlinuz-4.18.0-193.41.1.el8_2.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal rd.luks.options=discard ostree=/ostree/boot.1/rhcos/7a643d53f758c7980ee36d1e171c8396546f82ff80e3e22b1af4b9b36ed24f61/0 skew_tick=1 nohz=on rcu_nocbs=1-19,21-39,41-59,61-79 tuned.non_isolcpus=10000100,00100001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-19,21-39,41-59,61-79 systemd.cpu_affinity=0,40,20,60 default_hugepagesz=1G +
[root@cudutwo-worker-1 containers]#
[root@cudutwo-worker-1 containers]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz
Stepping: 7
CPU MHz: 1000.015
BogoMIPS: 4600.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 28160K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
[root@cudutwo-worker-1 containers]# dmesg | grep -i vfio
[44924.544631] VFIO - User Level meta-driver version: 0.3
[root@cudutwo-worker-1 containers]# ls /sys/kernel/iommu_groups/
0 102 107 111 116 120 125 13 134 16 20 25 3 34 39 43 48 52 57 61 66 70 75 8 84 89 93 98
1 103 108 112 117 121 126 130 135 17 21 26 30 35 4 44 49 53 58 62 67 71 76 80 85 9 94 99
10 104 109 113 118 122 127 131 136 18 22 27 31 36 40 45 5 54 59 63 68 72 77 81 86 90 95
100 105 11 114 119 123 128 132 14 19 23 28 32 37 41 46 50 55 6 64 69 73 78 82 87 91 96
101 106 110 115 12 124 129 133 15 2 24 29 33 38 42 47 51 56 60 65 7 74 79 83 88 92 97
Update: SriovNetworkNodePolicy policy's deviceType was vfio-pci in our environment. I will again share the update when I freshly deploy the cluster again.
After applying the sriov manifest files on the cluster deployed using ZTP playbooks, we found that vfio-pci driver module was not loaded on the worker nodes and hence CU workload VM deployment failed.