okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.71k stars 294 forks source link

Installation with agent installer or assisted installer with UPI on baremetal fails for v4.16.0-0.okd-scos-2024-08-21-155613 #2018

Open titou10titou10 opened 3 weeks ago

titou10titou10 commented 3 weeks ago

Context

Trying to install a cluster (3 masters + 2 workers):

It is important to note that the install works perfectly well with the exact same agent and install config files for

Summary

It fails with the following error from the "release-image-pivot" service:

okd5-master1 bootstrap-pivot.sh[25771]: error: Remounting /sysroot read-write: Permission denied

The cause of the problem is the OS image used as bootstrap: fedora-coreos-39.20231101.3.0-live.x86_64.iso

Details

All the details with debug info and configuration files are described in this discussion. The logs there etc are for v4.16.0-0.okd-scos-2024-08-01-132038 but they are the same for v4.16.0-0.okd-scos-2024-08-21-155613

Workarounds

Overriding the bootstrap OS image with a RHCOS image make the installation succeed

I did not choose a random bootstrap OS image, this is the one for v4.16 specified for an OCP installation via the ABI as specified here: https://github.com/openshift/assisted-service/blob/d3324b06a7c7772f4619c3ab13dd8c0706e55fd9/deploy/podman/configmap.yml#L25

It's probably possible to use another rhcos image as during the install process, the nodes upgrades to v418.9.202408211033-0

rpm-ostree status
State: idle
Deployments:
● ostree-unverified-registry:quay.io/okd/scos-content@sha256:3f4ca57e8ec68fb5a8ba5e2461c69162e211adba667dac299baf58ccf7923dad
                   Digest: sha256:3f4ca57e8ec68fb5a8ba5e2461c69162e211adba667dac299baf58ccf7923dad
                  Version: 418.9.202408211033-0 (2024-08-21T10:39:04Z)

Workaround for an Agent Installer (ABI) successful install:

Before building the ISO image, override the bootstrap OS image like this:

export OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE=https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/4.16/4.16.3/rhcos-4.16.3-x86_64-live.x86_64.iso
oc adm release extract --command=openshift-install quay.io/okd/scos-release:4.16.0-0.okd-scos-2024-08-21-155613
./openshift-install agent create image --dir install --log-level=debug

Workaround for an Assisted Installer successfull install:

The procedure is described here: https://github.com/openshift/assisted-service/tree/master/deploy/podman In the okd-configmap.yml file, replace (at least) the following variables:

OS_IMAGES: '[{"openshift_version":"4.16","cpu_architecture":"x86_64","url":"https://mirror.openshift.com/pub/openshift-v4/x86_64/dependencies/rhcos/4.16/4.16.3/rhcos-4.16.3-x86_64-live.x86_64.iso","version":"416.94.202406251923-0"}]'
RELEASE_IMAGES: '[{"openshift_version":"4.16","cpu_architecture":"x86_64","cpu_architectures":["x86_64"],"url":"quay.io/okd/scos-release:4.16.0-0.okd-scos-2024-08-21-155613","version":"4.16.0-0.okd-scos-2024-08-21-155613","default":true,"support_level":"beta"}]'
0xHexE commented 3 weeks ago

Hi @titou10titou10,

I tried the workaround but I think rhel is missing zincati quay.io/okd/scos-content@sha256:cb68498aceefa81f105c4ce6c74787c3e1281d141725b0e20df555aa549dc5aa this container exists with

Error msg: error running preset on unit: Failed to preset unit: Unit file zincati.service does not exist.\n)\nI0825 06:38:53.624260    6508 file_writers.go:293] Writing systemd unit \"install-to-disk.service\"\n"

and installation stuck at Installing: bootstrap. I even creating dummy zincati.service still fails.

0xHexE commented 3 weeks ago

I spoke too soon,

It took some hours to get reflected in the console. It turns out the zincati is not required.

And the bootkube commands take a while and while running doesn't create any logs in systemctl or change status while in running.

There was one issue though had to run this code to fix the network I am setting up single node installation

cat << EOF | tee /etc/kubernetes/cni/net.d/10-containerd-net.conflist
{
 "cniVersion": "1.0.0",
 "name": "containerd-net",
 "plugins": [
   {
     "type": "bridge",
     "bridge": "cni0",
     "isGateway": true,
     "ipMasq": true,
     "promiscMode": true,
     "ipam": {
       "type": "host-local",
       "ranges": [
         [{
           "subnet": "10.128.0.0/14"
         }]
       ],
       "routes": [
         { "dst": "0.0.0.0/0" },
         { "dst": "::/0" }
       ]
     }
   },
   {
     "type": "portmap",
     "capabilities": {"portMappings": true},
     "externalSetMarkChain": "KUBE-MARK-MASQ"
   }
 ]
}
EOF

Ref: https://github.com/okd-project/okd/issues/1966

titou10titou10 commented 3 weeks ago

I'm not sure what exactly your code is doing but maybe you are not aware that "extra" manifests can be added before the creation of the iso image. Inside the directory where you set the install-config and agent-config files, create an "openshift" directory and create additional manifests:

Refs:

This page seems related to what you are doing, and maybe you can create a manifest with it and put in under the install/openshift directory?

In my install, I have this extra "network-03-config.yaml" manifest file in install/openshift:

apiVersion: operator.openshift.io/v1
kind: Network
metadata:
  name: cluster
spec:
  defaultNetwork:
    ovnKubernetesConfig:
      genevePort: 6082
      # not necessary as OKD detects the underlying MTU and set the value to 9000-100 by itself
      mtu: 8900
      ipsecConfig:
        mode: Disabled
      ipv4:
        internalJoinSubnet: 100.65.0.0/16
        internalTransitSwitchSubnet: 100.89.0.0/16
0xHexE commented 3 weeks ago

When I boot the OKD control for first time the network plugin was not configured in journalctl I had log saying No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started so I created that file manually. I had dual stack configuration maybe that caused. I am installing it again let's see if I am getting the same issue. I think this caused because of some bug.

After some time I restarted the server actually couple of time after that ovn was not working at all. So I am trying to reinstall. I had some issues in my network I resolved them let's see if it works or not.

0xHexE commented 3 weeks ago

I was being too desperate it took some time and then the No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started gone away.

But @titou10titou10 thanks a lot for the investigation it was really big help saved a ton of time.