openshift / cluster-api-provider-libvirt

Apache License 2.0
36 stars 56 forks source link

After update to RHEL 8.5 + latest virt:av from 8.5, ilibvirt IPI no longer works #231

Closed ElCoyote27 closed 2 years ago

ElCoyote27 commented 2 years ago

This is a copy of https://github.com/openshift/installer/issues/5401#

ElCoyote27 commented 2 years ago

Hi,

I have been using the ansible-based ocp_libvirt_ipi role for some time on RHEL (7.8, then 7.9 and 8.2 and 8.3). The role leverages this code from the source code of the openshift installer.

      shell: |
        cd {{ kvm_workdir }}/go/src/github.com/openshift/installer/

Ever since patching my RHEL 8.5 hypervisors to the latest libvirt* packages from 'virt:av' stream, ocp_libvirt_ipi has been unable to deploy successfully: Terraform works but the freshly installed set of masters is unable to spawn 'workers'.

Broken cluster looks like this:

[root@palanthas ~]# virsh list
 Id   Name                   State
--------------------------------------
 1    dc03                   running
 2    dc02                   running
 3    ocp4p-pmrl7-master-2   running
 5    ocp4p-pmrl7-master-0   running
 6    ocp4p-pmrl7-master-1   running

A working cluster looks like this (for me):

 Id   Name                         State
--------------------------------------------
 1    dc02                         running
 2    dc03                         running
 10   ocp4p-wtvsg-master-2         running
 11   ocp4p-wtvsg-master-0         running
 12   ocp4p-wtvsg-master-1         running
 13   ocp4p-wtvsg-worker-0-wv8ds   running
 14   ocp4p-wtvsg-worker-0-xbdrc   running
 15   ocp4p-wtvsg-worker-0-9trjv   running
 19   ocp4p-wtvsg-infra-0-52qwp    running
 20   ocp4p-wtvsg-infra-0-92mv5    running
 21   ocp4p-wtvsg-infra-0-lkmgc    running

I have reproduced this with the code fro OCP 4.6, 4.7 and 4.8 and the results are the same.

The issue started occuring when the libvirt packages on my RHEL 8.5 hypervisors were updated from:

ElCoyote27 commented 2 years ago

On a fresh cluster which failed to launch the workers, I see this:

[root@daltigoth ~]# oc get events|grep worker
3h2m        Warning   FailedCreate             machine/ocp4d-nrgq6-worker-0-thlbw                  CreateError
178m        Warning   FailedCreate             machine/ocp4d-nrgq6-worker-0-thlbw                  CreateError
3m26s       Warning   FailedCreate             machine/ocp4d-nrgq6-worker-0-thlbw                  CreateError
3h2m        Warning   FailedCreate             machine/ocp4d-nrgq6-worker-0-xf8w4                  CreateError
178m        Warning   FailedCreate             machine/ocp4d-nrgq6-worker-0-xf8w4                  CreateError
3m32s       Warning   FailedCreate             machine/ocp4d-nrgq6-worker-0-xf8w4                  CreateError
3h2m        Warning   FailedCreate             machine/ocp4d-nrgq6-worker-0-zw4dx                  CreateError
178m        Warning   FailedCreate             machine/ocp4d-nrgq6-worker-0-zw4dx                  CreateError
3m28s       Warning   FailedCreate             machine/ocp4d-nrgq6-worker-0-zw4dx                  CreateError
ElCoyote27 commented 2 years ago
# oc logs  machine/ocp4d-nrgq6-worker-0-thlbw     
error: no kind "Machine" is registered for version "machine.openshift.io/v1beta1" in scheme "k8s.io/kubectl/pkg/scheme/scheme.go:28"
ElCoyote27 commented 2 years ago
NAMESPACE               NAME                         PHASE          TYPE   REGION   ZONE   AGE
openshift-machine-api   ocp4d-nrgq6-master-0         Running                               3h37m
openshift-machine-api   ocp4d-nrgq6-master-1         Running                               3h37m
openshift-machine-api   ocp4d-nrgq6-master-2         Running                               3h37m
openshift-machine-api   ocp4d-nrgq6-worker-0-thlbw   Provisioning                          3h34m
openshift-machine-api   ocp4d-nrgq6-worker-0-xf8w4   Provisioning                          3h34m
openshift-machine-api   ocp4d-nrgq6-worker-0-zw4dx   Provisioning                          3h34m

And:

 Name                                  Path
--------------------------------------------------------------------------------------------------------------------------
 ocp4d-nrgq6-base                      /var/lib/libvirt/openshift-images/ocp4d-nrgq6/ocp4d-nrgq6-base
 ocp4d-nrgq6-master-0                  /var/lib/libvirt/openshift-images/ocp4d-nrgq6/ocp4d-nrgq6-master-0
 ocp4d-nrgq6-master-1                  /var/lib/libvirt/openshift-images/ocp4d-nrgq6/ocp4d-nrgq6-master-1
 ocp4d-nrgq6-master-2                  /var/lib/libvirt/openshift-images/ocp4d-nrgq6/ocp4d-nrgq6-master-2
 ocp4d-nrgq6-master.ign                /var/lib/libvirt/openshift-images/ocp4d-nrgq6/ocp4d-nrgq6-master.ign
 ocp4d-nrgq6-worker-0-thlbw.ignition   /var/lib/libvirt/openshift-images/ocp4d-nrgq6/ocp4d-nrgq6-worker-0-thlbw.ignition
 ocp4d-nrgq6-worker-0-xf8w4            /var/lib/libvirt/openshift-images/ocp4d-nrgq6/ocp4d-nrgq6-worker-0-xf8w4
 ocp4d-nrgq6-worker-0-xf8w4.ignition   /var/lib/libvirt/openshift-images/ocp4d-nrgq6/ocp4d-nrgq6-worker-0-xf8w4.ignition
 ocp4d-nrgq6-worker-0-zw4dx.ignition   /var/lib/libvirt/openshift-images/ocp4d-nrgq6/ocp4d-nrgq6-worker-0-zw4dx.ignition

And then:

{
  "ignition": {
    "config": {
      "merge": [
        {
          "source": "https://api-int.ocp4d.openshift.lasthome.solace.krynn:22623/config/worker"
        }
      ]
    },
    "security": {
      "tls": {
        "certificateAuthorities": [
          {
            "source": "data:text/plain;charset=utf-8;base64,LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURFRENDQWZpZ0F3SUJBZ0lJTC9KNm5tUGx6Tk13RFFZSktvWklodmNOQVFFTEJRQXdKakVTTUJBR0ExVUUKQ3hNSmIzQmxibk5vYVdaME1SQXdEZ1lEVlFRR
EV3ZHliMjkwTFdOaE1CNFhEVEl4TVRFeE9URTJOREl3TUZvWApEVE14TVRFeE56RTJOREl3TUZvd0pqRVNNQkFHQTFVRUN4TUpiM0JsYm5Ob2FXWjBNUkF3RGdZRFZRUURFd2R5CmIyOTBMV05oTUlJQklqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FROEFNSUlCQ2dLQ0FRRUF1SjU1RWxYUC82Uz
kKTTkvWWViUC9neGxQeEFlNFJVYzhpbGdDaWdKV0ZLZFp5QnhTOXQzQ1AyU0d5REx5S2FoT0pNRTNSSWZKK3FhbAo3YjI2ZEpUVXQ4a0daYUhpbTQyOGRHYXdiOTRzQjJWWWJUNGpPNGl3enM3Tm9udHdjY1NQQllRdk9XcmsxekhPCmtyQXRBcGFtbWp1bU1yTTRhNFRCMVZWVG5NQnA3eVZZcUV
kZ3l0Z29PZnA3VVN6aTVKK3p5VHhxKzdEN2FXejcKY0Z4RC9PN0x1ZFBMNEpkNVlnWGlIbjg5NHREdTFLQVl5RHVYYnNrZCtITUFFZGQycVFPZUZNNW1TZlpWQ3pZMwowSU1aemxMRDdvL0ZJOFRqS2hkN2NjSFZocks5YXVCdHhBbVREMjZBVUxQSEdQbWhyNHdqaVc0bWZZR3pyVmo0Cmtjb3JK
SEsyMndJREFRQUJvMEl3UURBT0JnTlZIUThCQWY4RUJBTUNBcVF3RHdZRFZSMFRBUUgvQkFVd0F3RUIKL3pBZEJnTlZIUTRFRmdRVW5PcTZMRC9ycXRxMlZWMnZtdjNwdFZQVmZ1OHdEUVlKS29aSWh2Y05BUUVMQlFBRApnZ0VCQUZ1WlRJTnVVR01KTDBZSkRDT3h1ZS8rbnM4QUhtR1l2dDA4Q
XZyNmJQQkNIZlJ0L1lBbHJHRzkzS1RVClZVUFJIeVdFVlNNSWU3bEt4bWlvMERHSFFMNDBYaWxuQjBaVExOdE5yUkVLN3JEM1M2NFRXTjd5YTNIYVhtQnUKSTFZNFFsUGFacUFqR3R1YmJveGY2N1NUaHFsL09IcVNGdkxzcUo2NFAwQW0yQ3hGb3N3N2VpSW9uMWJkNEErMgpsdlJqdlJQMWYxQ0
xrWTlTREJoRlVUVmwyTGNKMmlIUXVVc1cvNEJWM1owWmp1dmREbDFVWnRFZFVPUWxOdkliCnBITUFieXIwQXpxdWZwN2taODZkUHQzNm80dDJTeVpDY1VpY3RwTmZTYzhyWFZzbUU0S1NjcGdxSEd6KzFybVgKK255WmNDTE8rM0dwK2w5RTBjRWMyYTEyQ0lzPQotLS0tLUVORCBDRVJUSUZJQ0F
URS0tLS0tCg=="
          }
        ]
      }
    },
    "version": "3.2.0"
  }
}
ElCoyote27 commented 2 years ago

@staebler Yes, there are errors, let me get them to you..

This is really strange as I can set LIBVIRT_DEFAULT_URI to the same value as the one I set in my install-config and I can 'virsh start/stop/shutdown/whatever':

$ grep URI ocp-config/install-config-daltigoth.yaml 
    URI: qemu+tcp://172.21.122.1/system

virt-OCP]$ export LIBVIRT_DEFAULT_URI=qemu+tcp://172.21.122.1/system
virt-OCP]$ virsh list
 Id   Name       State
--------------------------
 1    idm        running
 2    vom7       running
 3    registry   running
ElCoyote27 commented 2 years ago

It first starts like this:

# oc get machines -A
NAMESPACE               NAME                         PHASE          TYPE   REGION   ZONE   AGE
openshift-machine-api   ocp4d-c5tvf-master-0         Running                               15m
openshift-machine-api   ocp4d-c5tvf-master-1         Running                               15m
openshift-machine-api   ocp4d-c5tvf-master-2         Running                               15m
openshift-machine-api   ocp4d-c5tvf-worker-0-d7dzc   Provisioning                          12m
openshift-machine-api   ocp4d-c5tvf-worker-0-s8m7x   Provisioning                          12m
openshift-machine-api   ocp4d-c5tvf-worker-0-skd9j   Provisioning                          12m

At that time, I am getting the following YAML


apiVersion: v1
items:
- apiVersion: machine.openshift.io/v1beta1
  kind: Machine
  metadata:
    creationTimestamp: "2021-11-25T21:20:19Z"
    finalizers:
    - machine.machine.openshift.io
    generation: 1
    labels:
      machine.openshift.io/cluster-api-cluster: ocp4d-c5tvf
      machine.openshift.io/cluster-api-machine-role: master
      machine.openshift.io/cluster-api-machine-type: master
    name: ocp4d-c5tvf-master-0
    namespace: openshift-machine-api
    resourceVersion: "23000"
    uid: 082eb65f-ef9d-4962-a94d-b5fa54dbd100
  spec:
    metadata: {}
    providerSpec:
      value:
        apiVersion: libvirtproviderconfig.openshift.io/v1beta1
        autostart: false
        cloudInit: null
        domainMemory: 24576
        domainVcpu: 8
        ignKey: ""
        ignition:
          userDataSecret: master-user-data
        kind: LibvirtMachineProviderConfig
        networkInterfaceAddress: 192.168.126.0/24
        networkInterfaceHostname: ""
        networkInterfaceName: ocp4d-c5tvf
        networkUUID: ""
        uri: qemu+tcp://172.21.122.1/system
        volume:
          baseVolumeID: ocp4d-c5tvf-base
          poolName: ocp4d-c5tvf
          volumeName: ""
          volumeSize: 274877906944
  status:
    addresses:
    - address: 192.168.126.11
      type: InternalIP
    - address: ocp4d-c5tvf-master-0
      type: Hostname
    - address: ocp4d-c5tvf-master-0
      type: InternalDNS
    lastUpdated: "2021-11-25T21:36:47Z"
    nodeRef:
      kind: Node
      name: ocp4d-c5tvf-master-0
      uid: e5c4d2dc-c60d-4e6c-84c6-534e32627c9f
    phase: Running
    providerStatus:
      apiVersion: libvirtproviderconfig.openshift.io/v1beta1
      conditions: null
      instanceID: 7512f14c-cecc-4a8a-a648-24c3640ccec6
      instanceState: Running
      kind: LibvirtMachineProviderStatus
- apiVersion: machine.openshift.io/v1beta1
  kind: Machine
  metadata:
    creationTimestamp: "2021-11-25T21:20:18Z"
    finalizers:
    - machine.machine.openshift.io
    generation: 1
    labels:
      machine.openshift.io/cluster-api-cluster: ocp4d-c5tvf
      machine.openshift.io/cluster-api-machine-role: master
      machine.openshift.io/cluster-api-machine-type: master
    name: ocp4d-c5tvf-master-1
    namespace: openshift-machine-api
    resourceVersion: "23002"
    uid: 32a9e031-659d-4574-a40f-b74e4ff75aa2
  spec:
    metadata: {}
    providerSpec:
      value:
        apiVersion: libvirtproviderconfig.openshift.io/v1beta1
        autostart: false
        cloudInit: null
        domainMemory: 8192
        domainVcpu: 4
        ignKey: ""
        ignition:
          userDataSecret: master-user-data
        kind: LibvirtMachineProviderConfig
        networkInterfaceAddress: 192.168.126.0/24
        networkInterfaceHostname: ""
        networkInterfaceName: ocp4d-c5tvf
        networkUUID: ""
        uri: qemu+tcp://172.21.122.1/system
        volume:
          baseVolumeID: ocp4d-c5tvf-base
          poolName: ocp4d-c5tvf
          volumeName: ""
  status:
    addresses:
    - address: 192.168.126.12
      type: InternalIP
    - address: ocp4d-c5tvf-master-1
      type: Hostname
    - address: ocp4d-c5tvf-master-1
      type: InternalDNS
    lastUpdated: "2021-11-25T21:36:47Z"
    nodeRef:
      kind: Node
      name: ocp4d-c5tvf-master-1
      uid: 17ef4804-c5d5-401b-978c-fef91f715169
    phase: Running
    providerStatus:
      apiVersion: libvirtproviderconfig.openshift.io/v1beta1
      conditions: null
      instanceID: 8dddfb58-7e38-4105-a98a-6cb085fff45e
      instanceState: Running
      kind: LibvirtMachineProviderStatus
- apiVersion: machine.openshift.io/v1beta1
  kind: Machine
  metadata:
    creationTimestamp: "2021-11-25T21:20:18Z"
    finalizers:
    - machine.machine.openshift.io
    generation: 1
    labels:
      machine.openshift.io/cluster-api-cluster: ocp4d-c5tvf
      machine.openshift.io/cluster-api-machine-role: master
      machine.openshift.io/cluster-api-machine-type: master
    name: ocp4d-c5tvf-master-2
    namespace: openshift-machine-api
    resourceVersion: "23004"
    uid: a40f806d-09fd-4409-90c2-05d3deafbd02
  spec:
    metadata: {}
    providerSpec:
      value:
        apiVersion: libvirtproviderconfig.openshift.io/v1beta1
        autostart: false
        cloudInit: null
        domainMemory: 8192
        domainVcpu: 4
        ignKey: ""
        ignition:
          userDataSecret: master-user-data
        kind: LibvirtMachineProviderConfig
        networkInterfaceAddress: 192.168.126.0/24
        networkInterfaceHostname: ""
        networkInterfaceName: ocp4d-c5tvf
        networkUUID: ""
        uri: qemu+tcp://172.21.122.1/system
        volume:
          baseVolumeID: ocp4d-c5tvf-base
          poolName: ocp4d-c5tvf
          volumeName: ""
  status:
    addresses:
    - address: 192.168.126.13
      type: InternalIP
    - address: ocp4d-c5tvf-master-2
      type: Hostname
    - address: ocp4d-c5tvf-master-2
      type: InternalDNS
    lastUpdated: "2021-11-25T21:36:47Z"
    nodeRef:
      kind: Node
      name: ocp4d-c5tvf-master-2
      uid: a4f8fbec-17f6-4f3a-9579-458d0febb950
    phase: Running
    providerStatus:
      apiVersion: libvirtproviderconfig.openshift.io/v1beta1
      conditions: null
      instanceID: 8518304c-7e8b-46b0-b18c-1a86a7ba6f9b
      instanceState: Running
      kind: LibvirtMachineProviderStatus
- apiVersion: machine.openshift.io/v1beta1
  kind: Machine
  metadata:
    creationTimestamp: "2021-11-25T21:23:24Z"
    finalizers:
    - machine.machine.openshift.io
    generateName: ocp4d-c5tvf-worker-0-
    generation: 1
    labels:
      machine.openshift.io/cluster-api-cluster: ocp4d-c5tvf
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: ocp4d-c5tvf-worker-0
    name: ocp4d-c5tvf-worker-0-d7dzc
    namespace: openshift-machine-api
    ownerReferences:
    - apiVersion: machine.openshift.io/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: MachineSet
      name: ocp4d-c5tvf-worker-0
      uid: 05270d70-da6f-4b45-84f7-fc519e388f6b
    resourceVersion: "8467"
    uid: 7212d72a-50f6-42ef-a0df-555128dbd226
  spec:
    metadata: {}
    providerSpec:
      value:
        apiVersion: libvirtproviderconfig.openshift.io/v1beta1
        autostart: false
        cloudInit: null
        domainMemory: 49152
        domainVcpu: 8
        ignKey: ""
        ignition:
          userDataSecret: worker-user-data
        kind: LibvirtMachineProviderConfig
        networkInterfaceAddress: 192.168.126.0/24
        networkInterfaceHostname: ""
        networkInterfaceName: ocp4d-c5tvf
        networkUUID: ""
        uri: qemu+tcp://172.21.122.1/system
        volume:
          baseVolumeID: ocp4d-c5tvf-base
          poolName: ocp4d-c5tvf
          volumeName: ""
          volumeSize: 274877906944
  status:
    lastUpdated: "2021-11-25T21:23:27Z"
    phase: Provisioning
- apiVersion: machine.openshift.io/v1beta1
  kind: Machine
  metadata:
    creationTimestamp: "2021-11-25T21:23:24Z"
    finalizers:
    - machine.machine.openshift.io
    generateName: ocp4d-c5tvf-worker-0-
    generation: 1
    labels:
      machine.openshift.io/cluster-api-cluster: ocp4d-c5tvf
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: ocp4d-c5tvf-worker-0
    name: ocp4d-c5tvf-worker-0-s8m7x
    namespace: openshift-machine-api
    ownerReferences:
    - apiVersion: machine.openshift.io/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: MachineSet
      name: ocp4d-c5tvf-worker-0
      uid: 05270d70-da6f-4b45-84f7-fc519e388f6b
    resourceVersion: "8367"
    uid: c2228b56-78a6-475f-af59-01dc1670b9ad
  spec:
    metadata: {}
    providerSpec:
      value:
        apiVersion: libvirtproviderconfig.openshift.io/v1beta1
        autostart: false
        cloudInit: null
        domainMemory: 49152
        domainVcpu: 8
        ignKey: ""
        ignition:
          userDataSecret: worker-user-data
        kind: LibvirtMachineProviderConfig
        networkInterfaceAddress: 192.168.126.0/24
        networkInterfaceHostname: ""
        networkInterfaceName: ocp4d-c5tvf
        networkUUID: ""
        uri: qemu+tcp://172.21.122.1/system
        volume:
          baseVolumeID: ocp4d-c5tvf-base
          poolName: ocp4d-c5tvf
          volumeName: ""
          volumeSize: 274877906944
  status:
    lastUpdated: "2021-11-25T21:23:26Z"
    phase: Provisioning
- apiVersion: machine.openshift.io/v1beta1
  kind: Machine
  metadata:
    creationTimestamp: "2021-11-25T21:23:24Z"
    finalizers:
    - machine.machine.openshift.io
    generateName: ocp4d-c5tvf-worker-0-
    generation: 1
    labels:
      machine.openshift.io/cluster-api-cluster: ocp4d-c5tvf
      machine.openshift.io/cluster-api-machine-role: worker
      machine.openshift.io/cluster-api-machine-type: worker
      machine.openshift.io/cluster-api-machineset: ocp4d-c5tvf-worker-0
    name: ocp4d-c5tvf-worker-0-skd9j
    namespace: openshift-machine-api
    ownerReferences:
    - apiVersion: machine.openshift.io/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: MachineSet
      name: ocp4d-c5tvf-worker-0
      uid: 05270d70-da6f-4b45-84f7-fc519e388f6b
    resourceVersion: "8528"
    uid: 89562158-3bb6-4245-9640-ce8d337d3219
  spec:
    metadata: {}
    providerSpec:
      value:
        apiVersion: libvirtproviderconfig.openshift.io/v1beta1
        autostart: false
        cloudInit: null
        domainMemory: 49152
        domainVcpu: 8
        ignKey: ""
        ignition:
          userDataSecret: worker-user-data
        kind: LibvirtMachineProviderConfig
        networkInterfaceAddress: 192.168.126.0/24
        networkInterfaceHostname: ""
        networkInterfaceName: ocp4d-c5tvf
        networkUUID: ""
        uri: qemu+tcp://172.21.122.1/system
        volume:
          baseVolumeID: ocp4d-c5tvf-base
          poolName: ocp4d-c5tvf
          volumeName: ""
          volumeSize: 274877906944
  status:
    lastUpdated: "2021-11-25T21:23:29Z"
    phase: Provisioning
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

apiVersion: v1
items:
- apiVersion: machine.openshift.io/v1beta1
  kind: Machine
  metadata:
    creationTimestamp: "2021-11-25T21:20:19Z"
    finalizers:
    - machine.machine.openshift.io
    generation: 1
    labels:
      machine.openshift.io/cluster-api-cluster: ocp4d-c5tvf
      machine.openshift.io/cluster-api-machine-role: master
      machine.openshift.io/cluster-api-machine-type: master
    name: ocp4d-c5tvf-master-0
    namespace: openshift-machine-api
    resourceVersion: "23000"
    uid: 082eb65f-ef9d-4962-a94d-b5fa54dbd100
  spec:
    metadata: {}
    providerSpec:
      value:
        apiVersion: libvirtproviderconfig.openshift.io/v1beta1
        autostart: false
        cloudInit: null
        domainMemory: 24576
        domainVcpu: 8
        ignKey: ""
        ignition:
          userDataSecret: master-user-data
        kind: LibvirtMachineProviderConfig
        networkInterfaceAddress: 192.168.126.0/24
        networkInterfaceHostname: ""
        networkInterfaceName: ocp4d-c5tvf
        networkUUID: ""
        uri: qemu+tcp://172.21.122.1/system
        volume:
          baseVolumeID: ocp4d-c5tvf-base
          poolName: ocp4d-c5tvf
          volumeName: ""
          volumeSize: 274877906944
  status:
    addresses:
    - address: 192.168.126.11
      type: InternalIP
    - address: ocp4d-c5tvf-master-0
      type: Hostname
    - address: ocp4d-c5tvf-master-0
      type: InternalDNS
    lastUpdated: "2021-11-25T21:36:47Z"
    nodeRef:
      kind: Node
      name: ocp4d-c5tvf-master-0
      uid: e5c4d2dc-c60d-4e6c-84c6-534e32627c9f
    phase: Running
    providerStatus:
      apiVersion: libvirtproviderconfig.openshift.io/v1beta1
      conditions: null
      instanceID: 7512f14c-cecc-4a8a-a648-24c3640ccec6
      instanceState: Running
      kind: LibvirtMachineProviderStatus
- apiVersion: machine.openshift.io/v1beta1
  kind: Machine
  metadata:
    creationTimestamp: "2021-11-25T21:20:18Z"
    finalizers:
    - machine.machine.openshift.io
    generation: 1
    labels:
      machine.openshift.io/cluster-api-cluster: ocp4d-c5tvf
      machine.openshift.io/cluster-api-machine-role: master
      machine.openshift.io/cluster-api-machine-type: master
    name: ocp4d-c5tvf-master-1
    namespace: openshift-machine-api
    resourceVersion: "23002"
    uid: 32a9e031-659d-4574-a40f-b74e4ff75aa2
  spec:
    metadata: {}
    providerSpec:
      value:
        apiVersion: libvirtproviderconfig.openshift.io/v1beta1
        autostart: false
        cloudInit: null
        domainMemory: 8192
        domainVcpu: 4
        ignKey: ""
        ignition:
          userDataSecret: master-user-data
        kind: LibvirtMachineProviderConfig
        networkInterfaceAddress: 192.168.126.0/24
        networkInterfaceHostname: ""
        networkInterfaceName: ocp4d-c5tvf
ElCoyote27 commented 2 years ago
# oc describe machine ocp4d-c5tvf-worker-0-d7dzc|tail -10
        Volume Size:     274877906944
Status:
  Last Updated:  2021-11-25T21:23:27Z
  Phase:         Provisioning
Events:
  Type     Reason        Age                   From                Message
  ----     ------        ----                  ----                -------
  Warning  FailedCreate  21m (x2 over 21m)     libvirt-controller  CreateError
  Warning  FailedCreate  17m (x15 over 18m)    libvirt-controller  CreateError
  Warning  FailedCreate  2m42s (x19 over 14m)  libvirt-controller  CreateError
ElCoyote27 commented 2 years ago

There's also one machineset (workers only):

# oc get machinesets 
NAME                   DESIRED   CURRENT   READY   AVAILABLE   AGE
ocp4d-c5tvf-worker-0   3         3                             28m
ElCoyote27 commented 2 years ago

I'm also seeing those messages in the system's log:

`Nov 25 16:52:36 daltigoth libvirtd[1099001]: Operation not supported: can't update 'bridge' section of network 'ocp4d-c5tvf' Nov 25 16:52:39 daltigoth libvirtd[1099001]: Operation not supported: can't update 'bridge' section of network 'ocp4d-c5tvf' Nov 25 16:52:40 daltigoth libvirtd[1099001]: Operation not supported: can't update 'bridge' section of network 'ocp4d-c5tvf' Nov 25 16:52:41 daltigoth libvirtd[1099001]: Operation not supported: can't update 'bridge' section of network 'ocp4d-c5tvf' Nov 25 16:52:49 daltigoth libvirtd[1099001]: Operation not supported: can't update 'bridge' section of network 'ocp4d-c5tvf' Nov 25 16:52:50 daltigoth libvirtd[1099001]: Operation not supported: can't update 'bridge' section of network 'ocp4d-c5tvf' Nov 25 16:52:51 daltigoth libvirtd[1099001]: Operation not supported: can't update 'bridge' section of network 'ocp4d-c5tvf' Nov 25 16:53:10 daltigoth libvirtd[1099001]: Operation not supported: can't update 'bridge' section of network 'ocp4d-c5tvf' Nov 25 16:53:11 daltigoth libvirtd[1099001]: Operation not supported: can't update 'bridge' section of network 'ocp4d-c5tvf' Nov 25 16:53:12 daltigoth libvirtd[1099001]: Operation not supported: can't update 'bridge' section of network 'ocp4d-c5tvf' ``

ElCoyote27 commented 2 years ago

this seems somewhat similar to: https://github.com/digitalocean/go-libvirt/issues/87

The network created by libvirt ipi looks like this:

  <name>ocp4d-c5tvf</name>
  <uuid>18882b7e-9ac7-4089-bb28-2316ecbd2dbb</uuid>
  <forward mode='nat'>
    <nat>
      <port start='1024' end='65535'/>
    </nat>
  </forward>
  <bridge name='tt0' stp='on' delay='0'/>
  <mac address='52:54:00:27:65:df'/>
  <domain name='ocp4d.openshift.lasthome.solace.krynn' localOnly='yes'/>
  <dns enable='yes'>
    <forwarder domain='apps.ocp4d.openshift.lasthome.solace.krynn' addr='192.168.122.1'/>
    <host ip='192.168.126.12'>
      <hostname>api.ocp4d.openshift.lasthome.solace.krynn</hostname>
      <hostname>api-int.ocp4d.openshift.lasthome.solace.krynn</hostname>
    </host>
    <host ip='192.168.126.13'>
      <hostname>api.ocp4d.openshift.lasthome.solace.krynn</hostname>
      <hostname>api-int.ocp4d.openshift.lasthome.solace.krynn</hostname>
    </host>
    <host ip='192.168.126.11'>
      <hostname>api.ocp4d.openshift.lasthome.solace.krynn</hostname>
      <hostname>api-int.ocp4d.openshift.lasthome.solace.krynn</hostname>
    </host>
  </dns>
  <ip family='ipv4' address='192.168.126.1' prefix='24'>
    <dhcp>
      <range start='192.168.126.2' end='192.168.126.254'/>
      <host mac='52:54:00:b2:fb:a2' name='ocp4d-c5tvf-master-1.ocp4d.openshift.lasthome.solace.krynn' ip='192.168.126.12'/>
      <host mac='52:54:00:d2:10:54' name='ocp4d-c5tvf-bootstrap.ocp4d.openshift.lasthome.solace.krynn' ip='192.168.126.10'/>
      <host mac='52:54:00:32:97:83' name='ocp4d-c5tvf-master-2.ocp4d.openshift.lasthome.solace.krynn' ip='192.168.126.13'/>
      <host mac='52:54:00:4a:ce:da' name='ocp4d-c5tvf-master-0.ocp4d.openshift.lasthome.solace.krynn' ip='192.168.126.11'/>
    </dhcp>
  </ip>
</network>
ElCoyote27 commented 2 years ago

I'm sorry for the long dump of information.. I'd dig deeper but I am unsure where to look (I am still somewhat of an OpenShift newbie)

cfergeau commented 2 years ago

this seems somewhat similar to: digitalocean/go-libvirt#87

At first this seemed like a promising lead, however the only user of digitalocean's go-libvirt is terraform-provider-libvirt, and the openshift installer is using an older version of terraform-provider-libvirt which was not using digitalocean's libvirt implementation, but libvirt/libvirt-go.

I can also reproduce this, the logs from oc logs -n openshift-machine-api pods/machine-api-controllers-b8dd7845d-4knkf -c machine-controller were interesting, in that worker node creation fail in a loop because the storage volume already exists:

W1201 19:21:51.315400       1 controller.go:316] teuf-j44lc-worker-0-pwxsl: failed to create machine: teuf-j44lc-worker-0-pwxsl: error creating libvirt machine: error creating volume storage volume 'teuf-j44lc-worker-0-pwxsl' already exists                               

I think I've managed to capture the initial error which leaves the system in an inconsistent state:

E1201 19:11:12.624353       1 controller.go:280] teuf-j44lc-master-0: error updating machine: teuf-j44lc-master-0: error updating machine status: etcdserver: request timed out
I1201 19:11:13.625374       1 controller.go:170] teuf-j44lc-worker-0-pwxsl: reconciling Machine
I1201 19:11:13.625405       1 actuator.go:220] Checking if machine teuf-j44lc-worker-0-pwxsl exists.
I1201 19:11:13.627205       1 client.go:142] Created libvirt connection: 0xc0004b1370
I1201 19:11:13.627417       1 client.go:317] Check if "teuf-j44lc-worker-0-pwxsl" domain exists
I1201 19:11:13.627633       1 client.go:158] Freeing the client pool
I1201 19:11:13.627653       1 client.go:164] Closing libvirt connection: 0xc0004b1370
I1201 19:11:13.627946       1 controller.go:314] teuf-j44lc-worker-0-pwxsl: reconciling machine triggers idempotent create
I1201 19:11:13.627965       1 actuator.go:113] Creating machine "teuf-j44lc-worker-0-pwxsl"
I1201 19:11:13.629488       1 client.go:142] Created libvirt connection: 0xc0004b1620
I1201 19:11:13.629699       1 client.go:384] Create a libvirt volume with name teuf-j44lc-worker-0-pwxsl for pool teuf-j44lc from the base volume teuf-j44lc-base
I1201 19:11:13.797843       1 client.go:490] Volume ID: /var/lib/libvirt/openshift-images/teuf-j44lc/teuf-j44lc-worker-0-pwxsl
I1201 19:11:13.797876       1 client.go:181] Create resource libvirt_domain
I1201 19:11:13.841798       1 domain.go:155] Capabilities of host
 {XMLName:{Space: Local:capabilities} Host:{UUID:00000000-0000-0000-0000-6c626de9bb1f CPU:0xc0004cfd60 PowerManagement:0xc00058e990 IOMMU:0xc00077c310 MigrationFeatures:0xc00077c350 NUMA:0xc0003e2648 Cache:0xc0001537a0 MemoryBandwidth:<nil> SecModel:[{Name:selinux DOI:0 Labels:[{Type:kvm Value:system_u:system_r:svirt_t:s0} {Type:qemu Value:system_u:system_r:svirt_tcg_t:s0}]} {Name:dac DOI:0 Labels:[{Type:kvm Value:+107:+107} {Type:qemu Value:+107:+107}]}]} Guests:[{OSType:hvm Arch:{Name:i686 WordSize:32 Emulator:/usr/libexec/qemu-kvm Loader: Machines:[{Name:pc-i440fx-rhel7.6.0 MaxCPUs:240 Canonical:} {Name:pc MaxCPUs:240 Canonical:pc-i440fx-rhel7.6.0} {Name:pc-i440fx-rhel7.0.0 MaxCPUs:240 Canonical:} {Name:pc-i440fx-rhel7.5.0 MaxCPUs:240 Canonical:} {Name:pc-i440fx-4.2 MaxCPUs:240 Canonical:} {Name:pc-q35-rhel8.5.0 MaxCPUs:710 Canonical:} {Name:q35 MaxCPUs:710 Canonical:pc-q35-rhel8.5.0} {Name:pc-i440fx-rhel7.3.0 MaxCPUs:240 Canonical:} {Name:pc-q35-rhel8.3.0 MaxCPUs:710 Canonical:} {Name:pc-q35-rhel7.6.0 MaxCPUs:710 Canonical:} {Name:pc-i440fx-rhel7.1.0 MaxCPUs:240 Canonical:} {Name:pc-q35-rhel8.1.0 MaxCPUs:710 Canonical:} {Name:pc-q35-rhel7.4.0 MaxCPUs:710 Canonical:} {Name:pc-q35-rhel8.4.0 MaxCPUs:710 Canonical:} {Name:pc-i440fx-2.11 MaxCPUs:240 Canonical:} {Name:pc-i440fx-rhel7.4.0 MaxCPUs:240 Canonical:} {Name:pc-q35-rhel8.2.0 MaxCPUs:710 Canonical:} {Name:pc-q35-rhel7.5.0 MaxCPUs:710 Canonical:} {Name:pc-i440fx-rhel7.2.0 MaxCPUs:240 Canonical:} {Name:pc-q35-rhel8.0.0 MaxCPUs:710 Canonical:} {Name:pc-q35-rhel7.3.0 MaxCPUs:255 Canonical:}] Domains:[{Type:qemu Emulator: Machines:[]} {Type:kvm Emulator: Machines:[]}]} Features:0xc000568cc0} {OSType:hvm Arch:{Name:x86_64 WordSize:64 Emulator:/usr/libexec/qemu-kvm Loader: Machines:[{Name:pc-i440fx-rhel7.6.0 MaxCPUs:240 Canonical:} {Name:pc MaxCPUs:240 Canonical:pc-i440fx-rhel7.6.0} {Name:pc-i440fx-rhel7.0.0 MaxCPUs:240 Canonical:} {Name:pc-i440fx-rhel7.5.0 MaxCPUs:240 Canonical:} {Name:pc-i440fx-4.2 MaxCPUs:240 Canonical:} {Name:pc-q35-rhel8.5.0 MaxCPUs:710 Canonical:} {Name:q35 MaxCPUs:710 Canonical:pc-q35-rhel8.5.0} {Name:pc-i440fx-rhel7.3.0 MaxCPUs:240 Canonical:} {Name:pc-q35-rhel8.3.0 MaxCPUs:710 Canonical:} {Name:pc-q35-rhel7.6.0 MaxCPUs:710 Canonical:} {Name:pc-i440fx-rhel7.1.0 MaxCPUs:240 Canonical:} {Name:pc-q35-rhel8.1.0 MaxCPUs:710 Canonical:} {Name:pc-q35-rhel7.4.0 MaxCPUs:710 Canonical:} {Name:pc-q35-rhel8.4.0 MaxCPUs:710 Canonical:} {Name:pc-i440fx-2.11 MaxCPUs:240 Canonical:} {Name:pc-i440fx-rhel7.4.0 MaxCPUs:240 Canonical:} {Name:pc-q35-rhel8.2.0 MaxCPUs:710 Canonical:} {Name:pc-q35-rhel7.5.0 MaxCPUs:710 Canonical:} {Name:pc-i440fx-rhel7.2.0 MaxCPUs:240 Canonical:} {Name:pc-q35-rhel8.0.0 MaxCPUs:710 Canonical:} {Name:pc-q35-rhel7.3.0 MaxCPUs:255 Canonical:}] Domains:[{Type:qemu Emulator: Machines:[]} {Type:kvm Emulator: Machines:[]}]} Features:0xc0000eebc0}]}
I1201 19:11:13.841898       1 domain.go:161] Checking for x86_64/hvm against i686/hvm
I1201 19:11:13.841905       1 domain.go:161] Checking for x86_64/hvm against x86_64/hvm
I1201 19:11:13.841919       1 domain.go:163] Found 21 machines in guest for x86_64/hvm
I1201 19:11:13.841925       1 domain.go:171] Get machine name
I1201 19:11:13.841929       1 domain.go:161] Checking for x86_64/hvm against i686/hvm
I1201 19:11:13.841933       1 domain.go:161] Checking for x86_64/hvm against x86_64/hvm
I1201 19:11:13.841936       1 domain.go:163] Found 21 machines in guest for x86_64/hvm
I1201 19:11:13.857638       1 client.go:199] Create volume
I1201 19:11:13.858038       1 domain.go:309] Getting disk volume
I1201 19:11:13.858360       1 domain.go:315] Constructing domain disk source
I1201 19:11:13.858378       1 client.go:209] Create ignition configuration
I1201 19:11:13.858384       1 ignition.go:20] Creating ignition file
E1201 19:11:17.465411       1 leaderelection.go:325] error retrieving resource lock openshift-machine-api/cluster-api-provider-libvirt-leader: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-api/configmaps/cluster-api-provider-libvirt-leader": context deadline exceeded
I1201 19:11:17.465492       1 leaderelection.go:278] failed to renew lease openshift-machine-api/cluster-api-provider-libvirt-leader: timed out waiting for the condition
F1201 19:11:17.465540       1 main.go:127] leader election lost

Not sure why the leader election is lost/why the requests to etcdserver time out.

cfergeau commented 2 years ago

The above trace might be a different issue, maybe unrelated. I upgraded the installer to github.com/dmacvicar/terraform-provider-libvirt v0.6.9, and the error I get in the logs are more useful:

I1203 15:46:42.970874       1 domain.go:161] Checking for x86_64/hvm against i686/hvm
I1203 15:46:42.970887       1 domain.go:161] Checking for x86_64/hvm against x86_64/hvm
I1203 15:46:42.970892       1 domain.go:163] Found 21 machines in guest for x86_64/hvm
I1203 15:46:42.970897       1 domain.go:171] Get machine name
I1203 15:46:42.970902       1 domain.go:161] Checking for x86_64/hvm against i686/hvm
I1203 15:46:42.970906       1 domain.go:161] Checking for x86_64/hvm against x86_64/hvm
I1203 15:46:42.970910       1 domain.go:163] Found 21 machines in guest for x86_64/hvm
I1203 15:46:42.974066       1 client.go:199] Create volume
I1203 15:46:42.974363       1 domain.go:309] Getting disk volume
I1203 15:46:42.974571       1 domain.go:315] Constructing domain disk source
I1203 15:46:42.974593       1 client.go:209] Create ignition configuration
I1203 15:46:42.974617       1 ignition.go:20] Creating ignition file
I1203 15:46:42.977522       1 ignition.go:40] Ignition: {Name:teuf-lnswk-worker-0-75x94.ignition PoolName:teuf-lnswk Content:{"ignition":{"config":{"merge":[{"source":"https://api-int.teuf.tt.testing:22623/config/worker"}]},"security":{"tls":{"certificateAuthorities":[{"source":"data:text/plain;charset=utf-8;base64,LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURFRENDQWZpZ0F3SUJBZ0lJVHJvQ3hoQThUb3d3RFFZSktvWklodmNOQVFFTEJRQXdKakVTTUJBR0ExVUUKQ3hNSmIzQmxibk5vYVdaME1SQXdEZ1lEVlFRREV3ZHliMjkwTFdOaE1CNFhEVEl4TVRJd016RTBNemd5TmxvWApEVE14TVRJd01URTBNemd5Tmxvd0pqRVNNQkFHQTFVRUN4TUpiM0JsYm5Ob2FXWjBNUkF3RGdZRFZRUURFd2R5CmIyOTBMV05oTUlJQklqQU5CZ2txaGtpRzl3MEJBUUVGQUFPQ0FROEFNSUlCQ2dLQ0FRRUExRWprL1VFWXdzVWEKczF2SElUTVdtWXNaaURINGhTMG5yTmN3QmsxWlNnMnEwZFRPcS9tdC9Bc2x4b0Y5dzVyZWFXUCtuaUlYdG00bQpLUWdUUVg1UXUxRzVmZjlKeVFNYXllNEN1S201bmY5cVlhVmtGQno1K0F0cHZhUUd3dUVrZUtWT3FBbFdnV1pDCmp4SWRpYTdxSEwzU1FNL0hoMmc4TFlRWXNRamMwdlAwa0FJSHVVYTUwR2YyQnJEZ1VUSURmMHpzdkFrQnhXeTMKZE0wcXd0cUZKMWtqcUhMb2grREpzRlBpUEg1OHpBMTY2Y3ZoTjA0VldFNTlkZ2k1ZkFweHpUaVBEV0p2OUEzVQo4NGhieWFPc3dpZGJnQlNSMDcyYmJzTGo5UXc2cy9rVHliTTBZWkdyTnRMcjJlaUdCbmtLalFHeGZPdjJKUEt1ClppcjVtNmtpSndJREFRQUJvMEl3UURBT0JnTlZIUThCQWY4RUJBTUNBcVF3RHdZRFZSMFRBUUgvQkFVd0F3RUIKL3pBZEJnTlZIUTRFRmdRVTZjbUE4UUZzMTZTQStjRHREQTlBbGkxNlBJb3dEUVlKS29aSWh2Y05BUUVMQlFBRApnZ0VCQUNhRHVSQ2k5R2ZJcTZ2am5xdnVsWmt2cVRyQkhIOU10K1pYalBhSGFIa0Z3M2FtaHFsU0N0WHUwYUcvCitPWmRSOUV5bDJFVkt2Tk4yeXQ3eVRuZTVkY2JTVVA0eFYvcUxSNStIdW1NUFdQbjY0ZnVGY0F6SU5RemR4YWQKMzNHOVpkVHR5eGdJc2ZZUzRQWTNiWEo2U2dLaEVKejJPMXh1UVpVYlNzQldweFljbDR0VFhna3RsVllrMlNGWgowdlZtZHJRQTZ4bllkdnd4L2tPMHplL1R3R1RxazNUZHJQOUNHN29xeUFickJ4V1dLU29TRG96aXQ2RndtNXpWCjhQL3dHeHRyNUN1NGpZN3lSNXVxcXNFaFFDc2hnMFUzTk13dzMyME9Hd2Nhdzl0WUxMaVBSdnQzcE16TjVVYnEKdGdTa1ZLWXVsVWpVN0N6MmVCN3FOYmZCbjRBPQotLS0tLUVORCBDRVJUSUZJQ0FURS0tLS0tCg=="}]}},"version":"3.2.0"}}}
I1203 15:46:42.977554       1 ignition.go:102] Creating Ignition temporary file
I1203 15:46:43.062709       1 volume.go:190] 1717 bytes uploaded
I1203 15:46:43.063125       1 volume.go:160] Volume ID: /var/lib/libvirt/openshift-images/teuf-lnswk/teuf-lnswk-worker-0-75x94.ignition
I1203 15:46:43.063300       1 client.go:237] Set up network interface
I1203 15:46:43.064095       1 domain.go:391] Networkaddress: 192.168.126.0/24
I1203 15:46:43.064132       1 domain.go:417] Adding IP/MAC/host=192.168.126.70/7e:75:1a:57:b9:5b/teuf-lnswk-worker-0-75x94 to teuf-lnswk
I1203 15:46:43.064183       1 network.go:99] Updating host with XML:
  <host mac="7e:75:1a:57:b9:5b" name="teuf-lnswk-worker-0-75x94" ip="192.168.126.70"></host>
I1203 15:46:43.065266       1 client.go:496] Check if "teuf-lnswk-worker-0-75x94" volume exists
I1203 15:46:43.065589       1 client.go:533] Deleting volume teuf-lnswk-worker-0-75x94
I1203 15:46:43.066846       1 client.go:496] Check if "teuf-lnswk-worker-0-75x94_cloud-init" volume exists
I1203 15:46:43.067312       1 client.go:530] Volume teuf-lnswk-worker-0-75x94_cloud-init does not exists
I1203 15:46:43.067327       1 client.go:496] Check if "teuf-lnswk-worker-0-75x94.ignition" volume exists
I1203 15:46:43.067550       1 client.go:533] Deleting volume teuf-lnswk-worker-0-75x94.ignition
E1203 15:46:43.068690       1 actuator.go:107] Machine error: error creating domain virError(Code=84, Domain=19, Message='Operation not supported: can't update 'bridge' section of network 'teuf-lnswk'')
E1203 15:46:43.068710       1 actuator.go:51] teuf-lnswk-worker-0-75x94: error creating libvirt machine: error creating domain virError(Code=84, Domain=19, Message='Operation not supported: can't update 'bridge' section of network 'teuf-lnswk'')
I1203 15:46:43.068717       1 client.go:158] Freeing the client pool
I1203 15:46:43.068727       1 client.go:164] Closing libvirt connection: 0xc0005b3bd0
W1203 15:46:43.069026       1 controller.go:316] teuf-lnswk-worker-0-75x94: failed to create machine: teuf-lnswk-worker-0-75x94: error creating libvirt machine: error creating domain virError(Code=84, Domain=19, Message='Operation not supported: can't update 'bridge' section of network 'teuf-lnswk'')

My main reason for upgrading the terraform libvirt provider was to make sure I get a version which is not impacted by https://github.com/digitalocean/go-libvirt/issues/87

This error in the log matches the messages mentioned earlier which show up in the system's logs:

Nov 25 16:52:36 daltigoth libvirtd[1099001]: Operation not supported: can't update 'bridge' section of network 'ocp4d-c5tvf

And actually, your initial hunch that it's related to https://github.com/digitalocean/go-libvirt/issues/87 is most likely correct. The libvirt version in virt:av is

$ rpm -q libvirt-daemon
libvirt-daemon-7.6.0-6.module+el8.5.0+13051+7ddbe958.x86_64

What matters on the host is the daemon version. What I missed initially is that the libvirt client library involved in all of this is not the one running on the host, but the one used by the libvirt machine-api-controller which runs in a container on the cluster!

$ oc rsh -c machine-controller -n openshift-machine-api pods/machine-api-controllers-b8dd7845d-b72b7
$ rpm -qa |grep libvirt
libvirt-libs-6.0.0-35.1.module+el8.4.0+11273+64eb94ef.x86_64

https://github.com/digitalocean/go-libvirt/issues/87 is related to https://listman.redhat.com/archives/libvir-list/2021-March/msg00760.html , which landed in:

$ git describe --contains ':/lib: Fix calling of virNetworkUpdate'
v7.2.0-rc1~9

The commit log says:

But, to maintain compatibility with older, yet unfixed, daemons new connection feature is introduced. The feature is detected just before calling the callback and allows client to pass arguments in correct order (talking to fixed daemon) or in reversed order (talking to older daemon).

Unfortunately, older client talking to newer daemon can't be fixed. Let's hope that it's less frequent scenario.

In our case, we are exactly in that scenario, the client library used by the libvirt machine-api-controller are older than this commit, the daemon is newer than this commit, so the arguments to virNetUpdate calls are going to be swapped, which correspond exactly to what is being reported in the logs. This can be double checked by looking at cluster-api-provider-libvirt sources, the failing call is:

return n.Update(libvirt.NETWORK_UPDATE_COMMAND_MODIFY, libvirt.NETWORK_SECTION_IP_DHCP_HOST, -1, xmlDesc, libvirt.NETWORK_UPDATE_AFFECT_CURRENT)

The value of libvirt.NETWORK_UPDATE_COMMAND_MODIFY is 1, so is the value of libvirt.NETWORK_SECTION_BRIDGE, so the error message Operation not supported: can't update 'bridge' section of network 'ocp4d-c5tvf is consistent with this theory.

The conclusion is that we need a version of the libvirt client libraries with this fix https://listman.redhat.com/archives/libvir-list/2021-March/msg00760.html in our cluster-api-provider-libvirt container image. Need to check what is the status of this in the RHEL packages which will get picked in newer builds of this image.

cfergeau commented 2 years ago

Filed https://bugzilla.redhat.com/show_bug.cgi?id=2029380 to try to get some backports to libvirt client from RHEL

briantward commented 2 years ago

For those on a more recent hypervisor with libvirt 7.6 wanting a hack to move forward while waiting for potential fixes in the older libvirt library that is still in RHEL to make their way into the cluster-api-provider-libvirt image (not sure if this would help if your hypervisor is RHEL with the virt:av stream).

I ran into this exact scenario using Fedora 35 as my hypervisor while attempting to install an OKD 4.9 cluster.

During the openshift-install process, you can replace the machine-controller image in your cluster build by doing the following, after the MAO has created the necessary resources:

  1. Scale down the CVO and the MAO to prevent your changes from being reverted.

    oc scale -n openshift-cluster-version deployments/cluster-version-operator --replicas=0 
    oc scale -n openshift-machine-api deployment/machine-api-operator --replicas=0
  2. Patch the MAC with a newer image that has the 7.6 libvirt-lib package. Note that I built this particular image from source earlier tonight, using the Fedora 35 base. You may wish to use your own build.

oc -n openshift-machine-api patch deployment machine-api-controllers --patch '{"spec": {"template": {"spec": {"containers": [{"name": "machine-controller","image": "quay.io/bward/bward-cluster-api-provider-libvirt"}]}}}}'

This gets the cluster moving and working. Obviously if you want other features and are done with provisioning machines, you should reenable the CVO.

ElCoyote27 commented 2 years ago

This is a great tip, Brian. Thank you for sharing. Is there a dockerfile somewhere for that modified quay.io/bward/bward-cluster-api-provider-libvirt image? (I'm sorry, I am still a n00b at OCP) :)

briantward commented 2 years ago

@ElCoyote27 the dockerfile in this repo is currently using Fedora35 as a base image, so it picks up the latest 7.6 libs. That is what I built from.

https://github.com/openshift/cluster-api-provider-libvirt/blob/master/Dockerfile#L6

ElCoyote27 commented 2 years ago

since using libvirt IPI requires to rebuild the installer (see https://github.com/luisarizmendi/ocp-libvirt-ipi-role ) perhaps we could simply patch go.mod and go.sum and reference a newer build instead:

https://github.com/openshift/installer/blob/master/go.mod#L45 and https://github.com/openshift/installer/blob/master/go.sum#L1080-L1081

this would allow to keep using arbitrary older revisions of OCP and would not require to intercept the MAO + GCO when launching a new cluster.

openshift-bot commented 2 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 2 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 2 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci[bot] commented 2 years ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/cluster-api-provider-libvirt/issues/231#issuecomment-1206134957): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.