[ibm-telco] vfio-pci module was not loaded on the worker nodes.

maheshd2 commented 3 years ago

After applying the sriov manifest files on the cluster deployed using ZTP playbooks, we found that vfio-pci driver module was not loaded on the worker nodes and hence CU workload VM deployment failed.

yrobla commented 3 years ago

Can you provide your manifests configuration (or link to your git repo) ? Also , can you show error logs that reflect taht?

maheshd2 commented 3 years ago

@yrobla I have captured the issue details in the following git issue: https://github.com/shashiksingh/rhocp-castor/issues/143

yrobla commented 3 years ago

A get a 404 error there

novacain1 commented 3 years ago

Please note that vfio_pci should have been loaded if the policy's deviceType is vfio-pci. Suspect you have an incorrectly configured SriovNetworkNodePolicy.

That repository that you link to is a private repo, not everybody has access. Please help us by not opening the same issue in two different places.

maheshd2 commented 3 years ago

@novacain1 I will have a look at the configurations and try again. Thanks.

For the record purpose I'm adding the failure messages here,

We are seeing this issue when we deploy the CU workload manifest (with a dummy os image instead of actual image) Currently we are seeing this error on the intel cards. Earlier we saw this issue on the mellanox cards and then we tried with intel cards.

2021-03-11T05:37:42.843485209+00:00 stderr F {"component":"virt-launcher","level":"error","msg":"unsupported configuration: host doesn't support passthrough of host PCI devices","pos":"qemuHostdevPreparePCIDevicesCheckSupport:187","subcomponent":"libvirt","thread":"22","timestamp":"2021-03-11T05:37:42.843000Z"}
2021-03-11T05:37:42.843853049+00:00 stderr F {"component":"virt-launcher","level":"error","msg":"unsupported configuration: pci backend driver 'default' is not supported","pos":"virHostdevGetPCIHostDevice:253","subcomponent":"libvirt","thread":"22","timestamp":"2021-03-11T05:37:42.843000Z"}
2021-03-11T05:37:42.843860735+00:00 stderr F {"component":"virt-launcher","level":"error","msg":"Failed to allocate PCI device list: unsupported configuration: pci backend driver 'default' is not supported","pos":"virHostdevReAttachPCIDevices:1089","subcomponent":"libvirt","thread":"22","timestamp":"2021-03-11T05:37:42.843000Z"}
2021-03-11T05:37:42.844278935+00:00 stderr F {"component":"virt-launcher","kind":"","level":"error","msg":"Starting the VirtualMachineInstance failed.","name":"vm-ldc1-vcu1-mdn","namespace":"altiostar-4g-cu-ldc1","pos":"manager.go:1446","reason":"virError(Code=67, Domain=10, Message='unsupported configuration: host doesn't support passthrough of host PCI devices')","timestamp":"2021-03-11T05:37:42.844230Z","uid":"f1e58550-f941-4afd-a4ca-a5a3582472c2"}
2021-03-11T05:37:42.844312955+00:00 stderr F {"component":"virt-launcher","kind":"","level":"error","msg":"Failed to sync vmi","name":"vm-ldc1-vcu1-mdn","namespace":"altiostar-4g-cu-ldc1","pos":"server.go:161","reason":"virError(Code=67, Domain=10, Message='unsupported configuration: host doesn't support passthrough of host PCI devices')","timestamp":"2021-03-11T05:37:42.844287Z","uid":"f1e58550-f941-4afd-a4ca-a5a3582472c2"}

I could see virtualization is enabled:

[root@cudutwo-worker-1 containers]#
[root@cudutwo-worker-1 containers]# dmesg | grep -e DMAR -e IOMMU
[    0.000000] ACPI: DMAR 0x000000006FC0E000 000260 (v01 DELLOE DELLOSE  00000001 DELL 00000001)
[    0.000000] DMAR: IOMMU enabled
[    0.002004] DMAR: Host address width 46
[    0.003003] DMAR: DRHD base: 0x000000d37fc000 flags: 0x0
[    0.004007] DMAR: dmar0: reg_base_addr d37fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.005003] DMAR: DRHD base: 0x000000e0ffc000 flags: 0x0
[    0.006006] DMAR: dmar1: reg_base_addr e0ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.007003] DMAR: DRHD base: 0x000000ee7fc000 flags: 0x0
[    0.008005] DMAR: dmar2: reg_base_addr ee7fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.009003] DMAR: DRHD base: 0x000000fbffc000 flags: 0x0
[    0.010005] DMAR: dmar3: reg_base_addr fbffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.011003] DMAR: DRHD base: 0x000000aaffc000 flags: 0x0
[    0.012007] DMAR: dmar4: reg_base_addr aaffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.013003] DMAR: DRHD base: 0x000000b87fc000 flags: 0x0
[    0.014005] DMAR: dmar5: reg_base_addr b87fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.015003] DMAR: DRHD base: 0x000000c5ffc000 flags: 0x0
[    0.016005] DMAR: dmar6: reg_base_addr c5ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.017003] DMAR: DRHD base: 0x0000009d7fc000 flags: 0x1
[    0.018005] DMAR: dmar7: reg_base_addr 9d7fc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    0.019003] DMAR: RMRR base: 0x0000006a6f9000 end: 0x0000006abf8fff
[    0.020003] DMAR: RMRR base: 0x0000006ef5f000 end: 0x0000006ef61fff
[    0.021003] DMAR: ATSR flags: 0x0
[    0.022003] DMAR: ATSR flags: 0x0
[    0.023004] DMAR-IR: IOAPIC id 12 under DRHD base  0xc5ffc000 IOMMU 6
[    0.024003] DMAR-IR: IOAPIC id 11 under DRHD base  0xb87fc000 IOMMU 5
[    0.025003] DMAR-IR: IOAPIC id 10 under DRHD base  0xaaffc000 IOMMU 4
[    0.026003] DMAR-IR: IOAPIC id 18 under DRHD base  0xfbffc000 IOMMU 3
[    0.027003] DMAR-IR: IOAPIC id 17 under DRHD base  0xee7fc000 IOMMU 2
[    0.028003] DMAR-IR: IOAPIC id 16 under DRHD base  0xe0ffc000 IOMMU 1
[    0.029003] DMAR-IR: IOAPIC id 15 under DRHD base  0xd37fc000 IOMMU 0
[    0.030003] DMAR-IR: IOAPIC id 8 under DRHD base  0x9d7fc000 IOMMU 7
[    0.031003] DMAR-IR: IOAPIC id 9 under DRHD base  0x9d7fc000 IOMMU 7
[    0.033003] DMAR-IR: HPET id 0 under DRHD base 0x9d7fc000
[    0.034003] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    0.037209] DMAR-IR: Enabled IRQ remapping in x2apic mode
[    9.947562] DMAR: dmar6: Using Queued invalidation
[    9.952665] DMAR: dmar5: Using Queued invalidation
[    9.957766] DMAR: dmar4: Using Queued invalidation
[    9.962867] DMAR: dmar3: Using Queued invalidation
[    9.967965] DMAR: dmar2: Using Queued invalidation
[    9.973069] DMAR: dmar1: Using Queued invalidation
[    9.978171] DMAR: dmar7: Using Queued invalidation
[   11.620354] DMAR: Intel(R) Virtualization Technology for Directed I/O

[root@cudutwo-worker-1 containers]# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-7a643d53f758c7980ee36d1e171c8396546f82ff80e3e22b1af4b9b36ed24f61/vmlinuz-4.18.0-193.41.1.el8_2.x86_64 rhcos.root=crypt_rootfs random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal rd.luks.options=discard ostree=/ostree/boot.1/rhcos/7a643d53f758c7980ee36d1e171c8396546f82ff80e3e22b1af4b9b36ed24f61/0 skew_tick=1 nohz=on rcu_nocbs=1-19,21-39,41-59,61-79 tuned.non_isolcpus=10000100,00100001 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-19,21-39,41-59,61-79 systemd.cpu_affinity=0,40,20,60 default_hugepagesz=1G +
[root@cudutwo-worker-1 containers]#
[root@cudutwo-worker-1 containers]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  2
Core(s) per socket:  20
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6230N CPU @ 2.30GHz
Stepping:            7
CPU MHz:             1000.015
BogoMIPS:            4600.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            28160K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

[root@cudutwo-worker-1 containers]# dmesg | grep -i vfio
[44924.544631] VFIO - User Level meta-driver version: 0.3

[root@cudutwo-worker-1 containers]# ls /sys/kernel/iommu_groups/
0    102  107  111  116  120  125  13   134  16  20  25  3   34  39  43  48  52  57  61  66  70  75  8   84  89  93  98
1    103  108  112  117  121  126  130  135  17  21  26  30  35  4   44  49  53  58  62  67  71  76  80  85  9   94  99
10   104  109  113  118  122  127  131  136  18  22  27  31  36  40  45  5   54  59  63  68  72  77  81  86  90  95
100  105  11   114  119  123  128  132  14   19  23  28  32  37  41  46  50  55  6   64  69  73  78  82  87  91  96
101  106  110  115  12   124  129  133  15   2   24  29  33  38  42  47  51  56  60  65  7   74  79  83  88  92  97

maheshd2 commented 3 years ago

Update: SriovNetworkNodePolicy policy's deviceType was vfio-pci in our environment. I will again share the update when I freshly deploy the cluster again.

redhat-ztp / ztp-cluster-deploy

[ibm-telco] vfio-pci module was not loaded on the worker nodes. #85