okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.73k stars 295 forks source link

4.6.0-0.okd-2020-12-21-142926 revision-pruner pods stuck in ContainerCreating state (clean bare metal installation) #451

Closed danielchristianschroeter closed 3 years ago

danielchristianschroeter commented 3 years ago

Describe the bug Some revision-pruner pods in the namespace openshift-etcd, openshift-kube-apiserver, openshift-kube-scheduler and openshift-kube-controller-manager stuck in ContainerCreating state after a clean bare metal installation with 4.6.0-0.okd-2020-12-21-142926 and FCOS 33.20201214.2.0.

I see those error events in the related pods:

error while creating logical port openshift-kube-controller-manager_revision-pruner-6-k8s-master-2-01.okd.basedomain.com error: connection is shut down

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_revision-pruner-6-k8s-master-2-01.okd.basedomain.com_openshift-kube-controller-manager_1bdb3378-d92a-49bc-9b70-8c7cb09f21cb_0(4be998a739eadbdcd57d6355b81a0d33440a4614242f58a9f0e6c36ae48c9def): [openshift-kube-controller-manager/revision-pruner-6-k8s-master-2-01.okd.basedomain.com:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-kube-controller-manager/revision-pruner-6-k8s-master-2-01.okd.basedomain.com] failed to configure pod interface: timed out waiting for pod flows for pod: revision-pruner-6-k8s-master-2-01.okd.basedomain.com, error: timed out waiting for the condition '

I also tried a restart of master-2 but this did not changed anything. All clusteroperators shows available = true in oc get co For now I deleted the stucking pods. It seems to be a workaround but I don't think the deletion is a real solution...

Version 4.6.0-0.okd-2020-12-21-142926 with FCOS 33.20201214.2.0 (bate metal installation with VMs within VMware ESXi 6.7)

How reproducible

  1. oc adm release extract --tools registry.svc.ci.openshift.org/origin/release@sha256:068a04c84d0ef8d6325a37497da3d69152104ea357db29498378ad44760042f5

  2. Create install-config.yaml in install_dir

    apiVersion: v1
    baseDomain: basedomain.com
    proxy:
    httpProxy: http://proxy-01.***:3128
    httpsProxy: http://proxy-01.***:3128
    noProxy: localhost,127.0.0.0/8,::1/128,***,basedomain.com,okd.basedomain.com,10.1.232.0/24
    additionalTrustBundle: |
    -----BEGIN CERTIFICATE-----
    ***
    -----END CERTIFICATE-----
    compute:
    - hyperthreading: Enabled
    name: worker
    replicas: 0
    controlPlane:
    hyperthreading: Enabled
    name: master
    replicas: 3
    metadata:
    name: okd
    networking:
    clusterNetwork:
    - cidr: 10.200.0.0/16
    hostPrefix: 21
    networkType: OVNKubernetes
    serviceNetwork:
    - 172.30.0.0/16
    platform:
    none: {}
    fips: false
    pullSecret: '{"auths":{"fake":{"auth": "bar"}}}'
    sshKey: 'ssh-ed25519 AAAA***
  3. Create ignition files and upload them after to a HTTP server ./openshift-install create manifests --dir=install_dir/ ./openshift-install create ignition-configs --dir=install_dir/

  4. Create new VMs (1x bootstrap, 3x master) and boot from .iso https://builds.coreos.fedoraproject.org/prod/streams/testing/builds/33.20201214.2.0/x86_64/fedora-coreos-33.20201214.2.0-live.x86_64.iso

  5. Verify that all the DNS records are created and the required ips from the bootstrap and master are added the related load balancer pools for the required ports (bootstrap and master for Port 22623 and 6443; master for Port 443 and 80. api-int.okd.basedomain.com > LB-IP api.okd.basedomain.com > CNAME to api-int.okd.basedomain.com (LB) *.apps.okd.basedomain.com > CNAME to api-int.okd.basedomain.com (LB)

  6. Start the coreos-installer with the following parameter (I added append-karg to bypass the issue #394) sudo coreos-installer install /dev/sda --insecure-ignition --copy-network --ignition-url http://httpserverdomain.com/bootstrap.ign --append-karg="ip=10.1.232.57::10.1.232.1:255.255.255.0:k8s-bootstrap-1-01.okd.basedomain.com:ens160:none:10.1.231.85:10.1.231.5" sudo coreos-installer install /dev/sda --insecure-ignition --copy-network --ignition-url http://httpserverdomain.com/master.ign --append-karg="ip=10.1.232.191::10.1.232.1:255.255.255.0:k8s-master-1-01.okd.basedomain.com:ens160:none:10.1.231.85:10.1.231.5" sudo coreos-installer install /dev/sda --insecure-ignition --copy-network --ignition-url http://httpserverdomain.com/master.ign --append-karg="ip=10.1.232.13::10.1.232.1:255.255.255.0:k8s-master-2-01.okd.basedomain.com:ens160:none:10.1.231.85:10.1.231.5" sudo coreos-installer install /dev/sda --insecure-ignition --copy-network --ignition-url http://httpserverdomain.com/master.ign --append-karg="ip=10.1.232.158::10.1.232.1:255.255.255.0:k8s-master-3-01.okd.basedomain.com:ens160:none:10.1.231.85:10.1.231.5"

  7. Wait some hours after ./openshift-install --dir=install_dir/ wait-for bootstrap-complete --log-level=info is successful.

Log bundle log-bundle and must-gather can be downloaded here: https://drive.google.com/file/d/1jU3XRf3Si-4Ro5i3Nqi6CiBfkBHkFSw4/view?usp=sharing

danielchristianschroeter commented 3 years ago

I reinstalled the OKD cluster finally with the release 4.6.0-0.okd-2021-01-17-185703. The most important part on a bare matal installation is, that you start your bootstrap and master machines more or less at the same time. If you manually type the core-install command with append-karg (and you are not able to copy past it) it takes sometimes to long for the installation process...