kxr / ocp4_setup_upi_kvm

Script to Setup an OpenShift 4 UPI Cluster on KVM. Based on this guide: https://kxr.me/2019/08/17/openshift-4-upi-install-libvirt-kvm/
50 stars 55 forks source link

infinite loop operator #29

Closed jesusdevop closed 2 years ago

jesusdevop commented 2 years ago

Hello, the script enters a loop, I attach an example, it has reached operator 25, in a process that has lasted hours and I have aborted:

--> Image Downloaded: quay.io/openshift-release-dev/ocp-v4.0-art-dev-68f1a254ca71 --> Image Downloaded: quay.io/openshift-release-dev/ocp-v4.0-art-dev-c13ef4e8416a --> Phase Completed: ingress-operator-bootstrap --> Phase Completed: kube-controller-manager-bootstrap --> Phase Completed: kube-scheduler-bootstrap --> Image Downloaded: quay.io/openshift-release-dev/ocp-v4.0-art-dev-9af9a3146176 --> Image Downloaded: quay.io/openshift-release-dev/ocp-v4.0-art-dev-99bafb4c397f --> Image Downloaded: quay.io/openshift-release-dev/ocp-v4.0-art-dev-fee670561aab --> Phase Completed: cco-bootstrap --> Phase Completed: mco-bootstrap --> Container: cloud-credential-operator 0 Exited --> Container: machine-config-server 0 Running --> Container: machine-config-controller 0 Exited --> Image Downloaded: quay.io/openshift-release-dev/ocp-v4.0-art-dev-3acc45bb0d5d --> Container: kube-apiserver-insecure-readyz 0 Running --> Container: kube-apiserver 0 Running --> Container: kube-scheduler 0 Running --> Container: kube-controller-manager 0 Exited --> Container: setup 0 Exited --> Container: cluster-version-operator 0 Running --> Container: cloud-credential-operator 1 Exited ==> Kubernetes API is Up --> Container: kube-controller-manager 1 Running --> Container: cloud-credential-operator 2 Exited --> Container: cloud-credential-operator 3 Created --> Container: cloud-credential-operator 3 Running --> (bootkube.service is active, Kube API is Up) --> (bootkube.service is active, Kube API is Up)

kxr commented 2 years ago

Looks good so far. This is not really a loop per say. These are just updates from the bootstrap node. If no new activity is detected, the script just reports the status of bootkube service and status of API. On systems with shared/slower CPUs and/or slow internet connectivity, it can take some time. From what you have shown me, it looks fine and probably would have completed.

Once you see the "==> Kubernetes API is Up" message, you can actually open a separate shell session and start using oc cli to observe the activity in more details. You will find the kubeconfig file in <setup-dir>/install_dir/auth/kubeconfig. Make a copy of it and set it up in your KUBECONFIG environment variable. For example:

# If you don't have oc binary installed in your system, copy it from <setup-dir> to /usr/bin
cp <setup-dir>/oc /usr/bin/

# Make a copy of kubeconfig and set it in KUBECONFIG
cp <setup-dir>/install_dir/auth/kubeconfig /tmp/
export KUBECONFIG="/tmp/kubeconfig"

# Start monitoring the cluster progress
watch oc get clusterversion,co,nodes
jesusdevop commented 2 years ago

It is not completed, the process has taken more than 9 hours without finishing. My internet connection is good.

kxr commented 2 years ago

If possible, can you share:

jesusdevop commented 2 years ago

[root@localhost home]# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 24 On-line CPU(s) list: 0-23 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel BIOS Vendor ID: Intel(R) Corporation CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz BIOS Model name: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz Stepping: 4 CPU MHz: 1202.231 CPU max MHz: 3700.0000 CPU min MHz: 1200.0000

64GB RAM 400GB DIS physical host SO Centos 8 ./ocp4_setup_upi_kvm.sh --cluster-name ocp4 --ocp-version 4.5.latest --pull-secret /home/pull-secret --vm-dir /mnt/vms2/images

jesusdevop commented 2 years ago

ocp_install

jesusdevop commented 2 years ago

It does not finish installing and does not recognize the oc command

jesusdevop commented 2 years ago

The installation still continues

--> Container: kube-apiserver 9 Running --> Container: kube-apiserver 8 Exited --> Container: kube-controller-manager 5 Running --> (bootkube.service is active, Kube API is Up) --> (bootkube.service is active, Kube API is Up) --> (bootkube.service is active, Kube API is Up) --> (bootkube.service is active, Kube API is Up) --> (bootkube.service is active, Kube API is Up) --> Container: kube-apiserver 11 Running --> Container: kube-apiserver 10 Exited --> Container: kube-controller-manager 4 Running --> (bootkube.service is active, Kube API is Up) --> (bootkube.service is active, Kube API is Up) --> (bootkube.service is active, Kube API is Up) --> (bootkube.service is active, Kube API is Up) --> Container: cloud-credential-operator 5 Exited --> Container: kube-apiserver 12 Running --> Container: kube-apiserver 11 Exited --> (bootkube.service is active, Kube API is Up) --> (bootkube.service is active, Kube API is Up) --> (bootkube.service is active, Kube API is Up) --> (bootkube.service is active, Kube API is Up) --> Container: kube-apiserver 12 Exited --> Container: kube-apiserver 13 Running --> (bootkube.service is active, Kube API is Up) --> (bootkube.service is active, Kube API is Up) --> (bootkube.service is active, Kube API is Up) --> Container: kube-controller-manager 6 Exited --> Container: kube-controller-manager 7 Running

kxr commented 2 years ago

Yeah it looks like some thing is wrong.

does not recognize the oc command

See my previous comment to make oc work.

jesusdevop commented 2 years ago

Hi, It seems that there is an error in the script, but I have not modified anything ocp_install

jesusdevop commented 2 years ago

I have tried different versions of Openshift 4.2,4.3,4.5 and I always have the same result.

jesusdevop commented 2 years ago

[root@localhost auth]# cat kubeconfig clusters:

jesusdevop commented 2 years ago

I observe that it does not create .kube / config I have followed issue # 15 now the output is:

[root@localhost auth]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 13m Unable to apply 4.2.36: an unknown error has occurred [root@localhost auth]# oc get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE openshift-apiserver-operator openshift-apiserver-operator-855d65f5dd-tm2gs 0/1 Pending 0 8m16s openshift-cloud-credential-operator cloud-credential-operator-7c857c95c-trl9c 0/1 Pending 0 8m16s openshift-cluster-machine-approver machine-approver-d5dd4b565-sj2ft 0/1 Pending 0 8m16s openshift-cluster-version cluster-version-operator-55f9dffd86-7hr5j 1/1 Running 0 8m16s openshift-controller-manager-operator openshift-controller-manager-operator-69479fc684-qlnpt 0/1 Pending 0 8m16s openshift-controller-manager controller-manager-tqppx 1/1 Running 1 8m16s openshift-dns-operator dns-operator-647bf4d6c8-r9jdd 0/1 Pending 0 8m16s openshift-dns dns-default-8z8rb 2/2 Running 0 10m openshift-insights insights-operator-5f69749bfc-c5tmf 0/1 Pending 0 8m16s openshift-kube-apiserver-operator kube-apiserver-operator-565598c94d-pg52m 0/1 Pending 0 8m16s openshift-kube-controller-manager-operator kube-controller-manager-operator-7bfbf8c6b-t9w5k 0/1 Pending 0 8m11s openshift-kube-controller-manager installer-3-localhost 0/1 Completed 0 8m15s openshift-kube-controller-manager kube-controller-manager-localhost 2/2 Running 0 7m51s openshift-kube-scheduler-operator openshift-kube-scheduler-operator-bcc5fdb-rsww8 0/1 Pending 0 8m16s openshift-machine-api machine-api-operator-5d64f65dd-l69jm 0/1 Pending 0 8m14s openshift-machine-config-operator etcd-quorum-guard-79bcfb468b-8fdbw 1/1 Running 0 9m39s openshift-machine-config-operator etcd-quorum-guard-79bcfb468b-cdpb8 0/1 Pending 0 9m39s openshift-machine-config-operator etcd-quorum-guard-79bcfb468b-qmnn9 0/1 Pending 0 9m39s openshift-machine-config-operator machine-config-controller-848c565968-sbgj4 0/1 Pending 0 8m12s openshift-machine-config-operator machine-config-daemon-t8dmc 1/1 Running 0 10m openshift-machine-config-operator machine-config-operator-6977d6df5f-hrfnr 0/1 Pending 0 8m13s openshift-machine-config-operator machine-config-server-96fsw 1/1 Running 0 9m51s openshift-multus multus-admission-controller-vtmz8 1/1 Running 0 11m openshift-multus multus-jdfsm 1/1 Running 0 13m openshift-network-operator network-operator-7fd8bbd68c-qpnnr 0/1 Pending 0 8m16s openshift-sdn ovs-wx45q 1/1 Running 0 12m openshift-sdn sdn-controller-kb6l9 1/1 Running 0 12m openshift-sdn sdn-xqm8g 1/1 Running 0 12m openshift-service-ca-operator service-ca-operator-6c77dc59c8-72d9g 0/1 Pending 0 8m11s openshift-service-ca apiservice-cabundle-injector-78f4797b95-7fxrd 0/1 Pending 0 8m11s openshift-service-ca configmap-cabundle-injector-78667ddcf-c944q 0/1 Pending 0 8m11s openshift-service-ca service-serving-cert-signer-56f4599fcc-lwngp 0/1 Pending 0 8m16s

jesusdevop commented 2 years ago

This script is not for me :) Thanks for your work, I will find another script to create my cluster

[root@localhost ~]# oc get clusterversion Unable to connect to the server: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kube-apiserver-lb-signer")

kxr commented 2 years ago

Sure. From the info you have shared, it seems the dns is not setup as expected. The following pod names include the hostname at the end:

openshift-kube-controller-manager            installer-3-localhost                                    0/1     Completed   0          8m15s
openshift-kube-controller-manager            kube-controller-manager-localhost                        2/2     Running     0          7m51s

The fact that these pod names end with localhost, suggest that the coreos vms are not able to pickup their hostnames. This is mostly likely because of incorrect dns setup.

Anyway. If by any chance you try this script again, I would advise you to check the dns setup. You can read more here.

jesusdevop commented 2 years ago

Good Morning, I am trying the script again and continue with the error. I have followed the README of dns configuration. Can you help me please

jesusdevop commented 2 years ago

[root@localhost ~]# sudo dnf install qemu-kvm qemu-img bridge-utils libvirt libvirt-client virt-install libguestfs-tools-c Updating Subscription Management repositories. Last metadata expiration check: 3:17:43 ago on Mon 18 Oct 2021 05:12:26 AM CEST. Package qemu-kvm-15:4.2.0-48.module+el8.4.0+11909+3300d70f.3.x86_64 is already installed. Package qemu-img-15:4.2.0-48.module+el8.4.0+11909+3300d70f.3.x86_64 is already installed. Package bridge-utils-1.7.1-2.el8.x86_64 is already installed. Package libvirt-6.0.0-35.1.module+el8.4.0+11273+64eb94ef.x86_64 is already installed. Package libvirt-client-6.0.0-35.1.module+el8.4.0+11273+64eb94ef.x86_64 is already installed. Package virt-install-2.2.1-4.el8.noarch is already installed. Package libguestfs-tools-c-1:1.40.2-27.module+el8.4.0+9282+0bdec052.x86_64 is already installed. Dependencies resolved. Nothing to do. Complete! [root@localhost ~]# nmcli con show NAME UUID TYPE DEVICE eno1 b63287ec-b1c1-4166-9a91-339e38973e5a ethernet eno1
virbr0 fbb7c85a-ac12-4b72-aa5c-f695d7986374 bridge virbr0 eno2 6d81026c-cbab-48a8-9112-13104053c58c ethernet --
eno3 7a93de1c-8929-4b70-bf29-b944b09e6f1d ethernet --
eno4 ba017829-7567-4285-bca7-7d87d2202855 ethernet --
eno5 dea0695b-cbd2-4555-917c-1c5acff5ac6c ethernet --
eno6 f3101a3c-414c-4757-8b9c-ffd5ed03d2a6 ethernet --
eno7 78f44cc6-d897-4658-89c4-b9eb2a1591e1 ethernet --
eno8 d500200d-e427-4dad-b24f-970f55fc3e72 ethernet --
[root@localhost ~]# echo -e "[main]\ndns=dnsmasq" > /etc/NetworkManager/conf.d/nm-dns.conf [root@localhost ~]# systemctl restart NetworkManager [root@localhost ~]# cat /etc/resolv.conf

Generated by NetworkManager

nameserver 127.0.0.1 options edns0 trust-ad [root@localhost ~]# cat /etc/systemd/resolved.conf

This file is part of systemd.

#

systemd is free software; you can redistribute it and/or modify it

under the terms of the GNU Lesser General Public License as published by

the Free Software Foundation; either version 2.1 of the License, or

(at your option) any later version.

#

Entries in this file show the compile time defaults.

You can change settings by editing this file.

Defaults can be restored by simply deleting this file.

#

See resolved.conf(5) for details

[Resolve]

DNS=

FallbackDNS=

Domains=

LLMNR=yes

MulticastDNS=yes

DNSSEC=allow-downgrade

DNSOverTLS=no

Cache=yes

DNSStubListener=udp

DNS=127.0.0.1 Domains="~."

[root@localhost ~]# cd ocp4_setup_upi_kvm/ [root@localhost ocp4_setup_upi_kvm]# ./ocp4_setup_upi_kvm.sh --cluster-name ocp4 --ocp-version 4.5.latest --pull-secret /home/pull-secret --vm-dir /var/lib/libvirt/images/