Closed DanyC97 closed 5 years ago
/cc @cgwalters in case you have some thoughts from RHCOS/ MCO side.
are you sure you approved the CSR for your machine https://docs.openshift.com/container-platform/4.1/installing/installing_vsphere/installing-vsphere.html#installation-approve-csrs_installing-vsphere
was passed which then redirected to the custom dani-k8s-node-2.ign file which had injected the above snippets. That triggered a reboot between the ignition apply steps
what do you mean by rebooted between ignition apply steps.. ignition doesn't require reboot.
@abhinavdahiya thank you for taking the time to respond, much appreciated !
are you sure you approved the CSR for your machine https://docs.openshift.com/container-platform/4.1/installing/installing_vsphere/installing-vsphere.html#installation-approve-csrs_installing-vsphere
i never needed to approve any CSR, what i saw on previous deployments (for control and compute nodes) was that the CSRs were auto approved.
I'm not sure if there are two paths (if there are please help me understand how it works) here in v4 w.r.t approving the CSRs however all i can say is that looking in the cluster i have running (where the above node didn't join the cluster) i see
oc get -n openshift-cluster-machine-approver all
NAME READY STATUS RESTARTS AGE
pod/machine-approver-7cd7f97455-g9x5q 1/1 Running 0 11d
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/machine-approver 1/1 1 1 11d
NAME DESIRED CURRENT READY AGE
replicaset.apps/machine-approver-7cd7f97455 1 1 1 11d
which is coming from cluster-machine-approver and i can see in the pod's log traces like
I0624 20:31:13.706332 1 main.go:107] CSR csr-5tpq5 added
I0624 20:31:13.708184 1 main.go:132] CSR csr-5tpq5 not authorized: Invalid request
I0624 20:31:13.708206 1 main.go:164] Error syncing csr csr-5tpq5: Invalid request
I0624 20:31:13.748375 1 main.go:107] CSR csr-5tpq5 added
I0624 20:31:13.750478 1 main.go:132] CSR csr-5tpq5 not authorized: Invalid request
I0624 20:31:13.750496 1 main.go:164] Error syncing csr csr-5tpq5: Invalid request
I0624 20:31:13.830649 1 main.go:107] CSR csr-5tpq5 added
I0624 20:31:13.833242 1 main.go:132] CSR csr-5tpq5 not authorized: Invalid request
E0624 20:31:13.833321 1 main.go:174] Invalid request
I0624 20:31:13.833366 1 main.go:175] Dropping CSR "csr-5tpq5" out of the queue: Invalid request
Saying that i am not sure if this is part of the machine-api-operator as i haven't worked out how to trace backwards a deployment -> operator (any hints much appreciated ;) ) ..but i guess it is ..
Update
see attached the whole machine-approver-7cd7f97455-g9x5q
pod's log in case you might find s'thing ...
was passed which then redirected to the custom dani-k8s-node-2.ign file which had injected the above snippets. That triggered a reboot between the ignition apply steps
what do you mean by rebooted between ignition apply steps.. ignition doesn't require reboot.
please see the dani-k8s-node-2_journalctl.log attached. I've added some bookmarks ### dani ###
so you can see why i assumed the above sequence triggered the reboot.
@abhinavdahiya any thoughts? or maybe @staebler might be able to chime in? any addition info please let me know, i'm keeping the lab env up in case more info is needed, hopefully will help you out
@cgwalters @abhinavdahiya can anyone of you please help understand the design/ behavior of
kubelet service
not started on worker nodes@DanyC97 the controller-manager within the openshift-controller-manager namespaces actually does the issuing of the certificate. The logs there might help.
@DanyC97 The cluster-machine-approver will only approve machines that are added via a machine
resource. The vSphere platform does not use machine
resources yet. So, as a user, you must manually approve CSR requests for new nodes. As a convenience, the bootstrap machine will auto-approve CSRs requests for nodes while it is running. However, that should not be relied upon.
@DanyC97 The cluster-machine-approver will only approve machines that are added via a
machine
resource. The vSphere platform does not usemachine
resources yet. So, as a user, you must manually approve CSR requests for new nodes.
oh so there are 2 paths, many many thanks for sharing this info @staebler !
As a convenience, the bootstrap machine will auto-approve CSRs requests for nodes while it is running. However, that should not be relied upon.
right, so i'll try a test of switching off the bootstrap and see if the CSRs requests are left in pending state, that should confirm.
@DanyC97 the controller-manager within the openshift-controller-manager namespaces actually does the issuing of the certificate. The logs there might help.
i'll check @rphillips , thanks for the info! i'll be curious to understand the whole flow as things don't add up (yet) in my head:
@DanyC97 the controller-manager within the openshift-controller-manager namespaces actually does the issuing of the certificate. The logs there might help.
i'll check @rphillips , thanks for the info! i'll be curious to understand the whole flow as things don't add up (yet) in my head:
* if _vSphere platform does not use machine resources yet_ then no _cluster-machine-approver_ - OK * if no _cluster-machine-approver_ then let's say we fallback to bootstrap node - TBC * so then where _controller-manager within the openshift-controller-manager_ fits in the whole picture? cause is not part of the bootstrap node, isn't it? -> will find out
sadly @rphillips the only output i see in the pods running in openshift-controller-manager
ns is
W0629 08:33:30.961010 1 reflector.go:256] k8s.io/client-go/informers/factory.go:132: watch of *v1.ConfigMap ended with: too old resource version: 4962577 (4963900)
W0629 08:33:49.357787 1 reflector.go:256] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.TemplateInstance ended with: The resourceVersion for the provided watch is too old.
W0629 08:35:23.082164 1 reflector.go:256] github.com/openshift/client-go/route/informers/externalversions/factory.go:101: watch of *v1.Route ended with: The resourceVersion for the provided watch is too old.
W0629 08:36:01.201731 1 reflector.go:256] github.com/openshift/client-go/build/informers/externalversions/factory.go:101: watch of *v1.Build ended with: The resourceVersion for the provided watch is too old.
W0629 08:36:21.794776 1 reflector.go:256] github.com/openshift/client-go/apps/informers/externalversions/factory.go:101: watch of *v1.DeploymentConfig ended with: The resourceVersion for the provided watch is too old.
W0629 08:37:07.652800 1 reflector.go:256] github.com/openshift/client-go/image/informers/externalversions/factory.go:101: watch of *v1.ImageStream ended with: The resourceVersion for the provided watch is too old.
W0629 08:38:49.543850 1 reflector.go:256] github.com/openshift/origin/pkg/unidling/controller/unidling_controller.go:199: watch of *v1.Event ended with: The resourceVersion for the provided watch is too old.
W0629 08:41:06.451980 1 reflector.go:256] github.com/openshift/client-go/template/informers/externalversions/factory.go:101: watch of *v1.TemplateInstance ended with: The resourceVersion for the provided watch is too old.
W0629 08:41:17.966140 1 reflector.go:256] k8s.io/client-go/informers/factory.go:132: watch of *v1.ConfigMap ended with: too old resource version: 4964042 (4965587)
so not much related to CSRs
@DanyC97 The cluster-machine-approver will only approve machines that are added via a
machine
resource. The vSphere platform does not usemachine
resources yet. So, as a user, you must manually approve CSR requests for new nodes.oh so there are 2 paths, many many thanks for sharing this info @staebler !
As a convenience, the bootstrap machine will auto-approve CSRs requests for nodes while it is running. However, that should not be relied upon.
right, so i'll try a test of switching off the bootstrap and see if the CSRs requests are left in pending state, that should confirm.
@staebler you were spot on Sir !
i've turned off bootstrap node and it behaves as per the docs
[root@dani-dev ~]# oc get csr
NAME AGE REQUESTOR CONDITION
csr-28p87 3m4s system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
which imo this is a docs bug.
And for folks who need to know if bootstrap node did approve anything journalctl |grep approve-csr |grep -v "No resources found"
will do the job
was passed which then redirected to the custom dani-k8s-node-2.ign file which had injected the above snippets. That triggered a reboot between the ignition apply steps
what do you mean by rebooted between ignition apply steps.. ignition doesn't require reboot.
please see the dani-k8s-node-2_journalctl.log attached. I've added some bookmarks
### dani ###
so you can see why i assumed the above sequence triggered the reboot.
@abhinavdahiya if you can point me in the right direction, i'd be happy to keep digging.
which imo this is a docs bug.
@DanyC97 Can you be more specific about how you feel that the docs are deficient?
The following is an excerpt from the docs. It implies to me that the CSRs may be approved automatically or may need to be approved manually. It is the responsibility of the user to verify that the CSRs are approved--whether automatically or manually.
When you add machines to a cluster, two pending certificates signing request (CSRs) are generated for each machine that you added. You must confirm that these CSRs are approved or, if necessary, approve them yourself.
which imo this is a docs bug.
@DanyC97 Can you be more specific about how you feel that the docs are deficient?
The following is an excerpt from the docs. It implies to me that the CSRs may be approved automatically or may need to be approved manually. It is the responsibility of the user to verify that the CSRs are approved--whether automatically or manually.
When you add machines to a cluster, two pending certificates signing request (CSRs) are generated for each machine that you added. You must confirm that these CSRs are approved or, if necessary, approve them yourself.
sure, sorry i wasn't clear enough.
Indeed you are, the text implies that however You must confirm that these CSRs are approved -> if you trying to double check if they were approved, you won't be able to do so w/o a bit more info becasue:
oc get csr
=> will show no output (e.g Approved, Issued)
, they've been already dealt by bootstrap node. Maybe the assumption here is that: if you see no Pending
certs then everything worked okay.Imo having a section - a small note and/ or paragraph to mention what you taught me here is miles better, hence me saying a bug
Also in the docs it says on step 1
Confirm that the cluster recognizes the machines:
but you can't confirm that since there are no nodes in NotReady
state, they will appear if kubelet service is running .. however if you are unlucky like me then the docs won't help much.
Maybe is a section for troubleshooting, either way i think a clue can be added to help folks.
Update
sorry if i was too strong on the docs claiming is a bug, is maybe an enhancement
Ignition should have enabled kubelet.service
, it's part of the Ignition generated by the MCO.
was passed which then redirected to the custom dani-k8s-node-2.ign file which had injected the above snippets. That triggered a reboot between the ignition apply steps
what do you mean by rebooted between ignition apply steps.. ignition doesn't require reboot.
please see the dani-k8s-node-2_journalctl.log attached. I've added some bookmarks
### dani ###
so you can see why i assumed the above sequence triggered the reboot.@abhinavdahiya if you can point me in the right direction, i'd be happy to keep digging.
right after keep spinning new nodes, i found out that the dependent service
cat /etc/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet
Wants=rpc-statd.service
hasn't started ...hmmm
systemctl status rpc-statd
● rpc-statd.service - NFS status monitor for NFSv2/3 locking.
Loaded: loaded (/usr/lib/systemd/system/rpc-statd.service; static; vendor preset: disabled)
Active: inactive (dead)
[root@localhost ~]# cat /usr/lib/systemd/system/rpc-statd.service
[Unit]
Description=NFS status monitor for NFSv2/3 locking.
DefaultDependencies=no
Conflicts=umount.target
Requires=nss-lookup.target rpcbind.socket
Wants=network-online.target
After=network-online.target nss-lookup.target rpcbind.socket
PartOf=nfs-utils.service
[Service]
Environment=RPC_STATD_NO_NOTIFY=1
Type=forking
PIDFile=/var/run/rpc.statd.pid
ExecStart=/usr/sbin/rpc.statd
and the dependencies for rpc-statd
is
systemctl list-dependencies rpc-statd
rpc-statd.service
● ├─rpcbind.socket
● ├─system.slice
● ├─network-online.target
● │ └─NetworkManager-wait-online.service
● └─nss-lookup.target
where rpcbind.service
is not running.
And the output of the network-online.target
is up and happy
systemctl status network-online.target
● network-online.target - Network is Online
Loaded: loaded (/usr/lib/systemd/system/network-online.target; static; vendor preset: disabled)
Active: active since Tue 2019-07-02 12:29:55 UTC; 18min ago
Docs: man:systemd.special(7)
https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget
Jul 02 12:29:55 dani-k8s-node-2.dani.local systemd[1]: Reached target Network is Online.
updating in case someone else bumps into this issue.
A) the initial issue was around the fact that nodes we failing to join the cluster because the kubelet.service
After a lot of digging (few dead ends and lots of bunny hop) it turned out the problem was caused by my DNS setup, in particular my nodes fqdn was not withing the subdomain where the api
and api-int
were
E.g
NOK
dani-k8s-node-1.dani.local
endpoints => api-int.dev-okd4.dani.local
OK
dani-k8s-node-1.dev-okd4.dani.local
api-int.dev-okd4.dani.local
B) the second issue (which was more like a question for my own knowledge) was around the fact that the CSRs
were approved. With @staebler 's help i've understood there are 2 paths:
bootstrap
node is still up, the CSRs are auto aprovedif bootstrap node is still up, the CSRs are auto aproved
Probably what we should have done is only have the bootstrap approve CSRs for masters only or so...that's all we actually need for installs, and having it do workers too adds confusion.
@cgwalters thinking loud i think is okay to have both use-cases, especially when you build a 100 nodes cluster, you don't need a human to accept the CSRs nor keep running/ watching using your favor tool/ script for pending CSRs.
A note in the docs imo will be exactly what folks need:
I do not think that we want to advise leaving the bootstrap node running any longer than necessary.
@DanyC97 could you elaborate your findings regarding DNS a bit more? I'm bumping in the same issue as you describe when adding a node to an existing cluster. The initial provisioned cluster doesn't have fqdn hostnames within the api / api-int endpoint url either, so I'm curious how this is related?
What comes to my attention is the fact there are no symbolic links for kubelet or machine-config-daemon in /etc/systemd/system/multi-user.target.wants/ on the extra worker nodes, but they are present on the initial installation worker nodes. Any thoughts?
Version
RHCOS
Platform (aws|libvirt|openstack):
vmware
What happened?
Deployed an UPI VMware cluster and then started to add a new node. The VM booted, the igniton kicked in and lay down the RHCOS, including setting the static IP and the hostname however the node it never joined the K8s cluster,
oc get nodes
didn't show the new nodeWhat you expected to happen?
i expected the new node to show up in the
oc get nodes
How to reproduce it (as minimally and precisely as possible)?
1) create a cluster 2) scale up a node using a custom
dani-k8s-node-2.ign
file where the below files were added such that we can set static IP and hostname&
Note that a dummy ignition file
was passed which then redirected to the custom
dani-k8s-node-2.ign
file which had injected the above snippets. That triggered a reboot between the ignition apply steps3) observe if the node join the cluster, if not then check if
kubelet.service
is up and running.Note - i guess your terraform UPI's vsphere example should allow you to reproduce the issue, i haven't tried using your code
Anything else we need to know?
On closer inspection while ssh'ed onto the node i found the following
kubectl.service
not active and disabled[root@dani-k8s-node-2 ~]# systemctl status kubelet ● kubelet.service - Kubernetes Kubelet Loaded: loaded (/etc/systemd/system/kubelet.service; disabled; vendor preset: enabled) Active: inactive (dead)
[root@dani-k8s-node-2 ~]# cat /etc/systemd/system/kubelet.service [Unit] Description=Kubernetes Kubelet Wants=rpc-statd.service
[Service] Type=notify ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state EnvironmentFile=/etc/os-release EnvironmentFile=-/etc/kubernetes/kubelet-workaround EnvironmentFile=-/etc/kubernetes/kubelet-env
ExecStart=/usr/bin/hyperkube \ kubelet \ --config=/etc/kubernetes/kubelet.conf \ --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \ --kubeconfig=/var/lib/kubelet/kubeconfig \ --container-runtime=remote \ --container-runtime-endpoint=/var/run/crio/crio.sock \ --allow-privileged \ --node-labels=node-role.kubernetes.io/worker,node.openshift.io/os_version=${VERSION_ID},node.openshift.io/os_id=${ID} \ --minimum-container-ttl-duration=6m0s \ --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec \ --client-ca-file=/etc/kubernetes/ca.crt \ --cloud-provider=vsphere \ \ --anonymous-auth=false \ --v=3 \
Restart=always RestartSec=10
[Install] WantedBy=multi-user.target
[root@dani-k8s-node-2 ~]# ll /etc/systemd/system/multi-user.target.wants/ total 0 lrwxrwxrwx. 1 root root 39 May 20 23:06 chronyd.service -> /usr/lib/systemd/system/chronyd.service lrwxrwxrwx. 1 root root 70 May 20 23:06 console-login-helper-messages-issuegen.service -> /usr/lib/systemd/system/console-login-helper-messages-issuegen.service lrwxrwxrwx. 1 root root 47 May 20 23:06 coreos-growpart.service -> /usr/lib/systemd/system/coreos-growpart.service lrwxrwxrwx. 1 root root 69 May 20 23:06 coreos-regenerate-iscsi-initiatorname.service -> /usr/lib/systemd/system/coreos-regenerate-iscsi-initiatorname.service lrwxrwxrwx. 1 root root 67 May 20 23:06 coreos-root-bash-profile-workaround.service -> /usr/lib/systemd/system/coreos-root-bash-profile-workaround.service lrwxrwxrwx. 1 root root 51 May 20 23:06 coreos-useradd-core.service -> /usr/lib/systemd/system/coreos-useradd-core.service lrwxrwxrwx. 1 root root 36 May 20 23:06 crio.service -> /usr/lib/systemd/system/crio.service lrwxrwxrwx. 1 root root 59 May 20 23:06 ignition-firstboot-complete.service -> /usr/lib/systemd/system/ignition-firstboot-complete.service lrwxrwxrwx. 1 root root 42 May 20 23:06 irqbalance.service -> /usr/lib/systemd/system/irqbalance.service lrwxrwxrwx. 1 root root 41 May 20 23:06 mdmonitor.service -> /usr/lib/systemd/system/mdmonitor.service lrwxrwxrwx. 1 root root 46 May 20 23:06 NetworkManager.service -> /usr/lib/systemd/system/NetworkManager.service lrwxrwxrwx. 1 root root 51 Jun 17 17:24 ostree-finalize-staged.path -> /usr/lib/systemd/system/ostree-finalize-staged.path lrwxrwxrwx. 1 root root 37 May 20 23:06 pivot.service -> /usr/lib/systemd/system/pivot.service lrwxrwxrwx. 1 root root 48 May 20 23:06 remote-cryptsetup.target -> /usr/lib/systemd/system/remote-cryptsetup.target lrwxrwxrwx. 1 root root 40 May 20 23:06 remote-fs.target -> /usr/lib/systemd/system/remote-fs.target lrwxrwxrwx. 1 root root 53 May 20 23:06 rpm-ostree-bootstatus.service -> /usr/lib/systemd/system/rpm-ostree-bootstatus.service lrwxrwxrwx. 1 root root 36 May 20 23:06 sshd.service -> /usr/lib/systemd/system/sshd.service lrwxrwxrwx. 1 root root 40 May 20 23:06 vmtoolsd.service -> /usr/lib/systemd/system/vmtoolsd.service