Cloud controller manager kicking nodes out of the cluster

jargals2 commented 1 year ago

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug /kind feature

What happened: I have a kubernetes cluster with 3 masters and 3 workers. After installing the openstack cloud controller manager, whenever node reboot takes a bit longer or whenever a node shuts down, it is removed completely. I can't join another node to the cluster. The problem goes away when I uninstalled the openstack cloud controller manager.

What you expected to happen: With openstack cloud controller manager, shutting down node for a bit and restarting it would let automatically join to the cluster. New nodes should be able to join the cluster without a problem.

How to reproduce it:

Create a kubernetes cluster (I have tried it on a cluster created with RKE2 and kubeadm)
Install openstack cloud controller manager
Shut down a master node
Watch that the node gets removed from the cluster

Anything else we need to know?: Installed the openstack cloud controller manager using the following command:

helm repo add cpo https://kubernetes.github.io/cloud-provider-openstack
helm repo update
helm install openstack-ccm cpo/openstack-cloud-controller-manager --values values.yaml

Here is the log of CCM after a node is kicked

I1220 04:36:32.767136       1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I1220 04:36:32.767299       1 shared_informer.go:262] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I1220 04:36:32.767406       1 shared_informer.go:262] Caches are synced for RequestHeaderAuthRequestController
I1220 04:36:32.767626       1 tlsconfig.go:178] "Loaded client CA" index=0 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"rke2-client-ca@1671439711\" [] issuer=\"<self>\" (2022-12-19 08:48:31 +0000 UTC to 2032-12-16 08:48:31 +0000 UTC (now=2022-12-20 04:36:32.767569383 +0000 UTC))"
I1220 04:36:32.767830       1 tlsconfig.go:200] "Loaded serving cert" certName="Generated self signed cert" certDetail="\"localhost@1671510991\" [serving] validServingFor=[127.0.0.1,127.0.0.1,localhost] issuer=\"localhost-ca@1671510991\" (2022-12-20 03:36:31 +0000 UTC to 2023-12-20 03:36:31 +0000 UTC (now=2022-12-20 04:36:32.767811265 +0000 UTC))"
I1220 04:36:32.768032       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1671510992\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1671510992\" (2022-12-20 03:36:31 +0000 UTC to 2023-12-20 03:36:31 +0000 UTC (now=2022-12-20 04:36:32.768012446 +0000 UTC))"
I1220 04:36:32.768148       1 tlsconfig.go:178] "Loaded client CA" index=0 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"rke2-client-ca@1671439711\" [] issuer=\"<self>\" (2022-12-19 08:48:31 +0000 UTC to 2032-12-16 08:48:31 +0000 UTC (now=2022-12-20 04:36:32.768129238 +0000 UTC))"
I1220 04:36:32.768243       1 tlsconfig.go:178] "Loaded client CA" index=1 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"rke2-request-header-ca@1671439711\" [] issuer=\"<self>\" (2022-12-19 08:48:31 +0000 UTC to 2032-12-16 08:48:31 +0000 UTC (now=2022-12-20 04:36:32.768162719 +0000 UTC))"
I1220 04:36:32.768750       1 tlsconfig.go:200] "Loaded serving cert" certName="Generated self signed cert" certDetail="\"localhost@1671510991\" [serving] validServingFor=[127.0.0.1,127.0.0.1,localhost] issuer=\"localhost-ca@1671510991\" (2022-12-20 03:36:31 +0000 UTC to 2023-12-20 03:36:31 +0000 UTC (now=2022-12-20 04:36:32.768723612 +0000 UTC))"
I1220 04:36:32.770378       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1671510992\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1671510992\" (2022-12-20 03:36:31 +0000 UTC to 2023-12-20 03:36:31 +0000 UTC (now=2022-12-20 04:36:32.770349756 +0000 UTC))"
I1220 04:39:24.044766       1 controller.go:753] Syncing backends for all LB services.
I1220 04:39:24.044772       1 controller.go:760] Successfully updated 0 out of 0 load balancers to direct traffic to the updated set of nodes
I1220 04:39:28.367695       1 node_lifecycle_controller.go:156] deleting node since it is no longer present in cloud provider: master3
I1220 04:39:28.367966       1 event.go:294] "Event occurred" object="master3" fieldPath="" kind="Node" apiVersion="" type="Normal" reason="DeletingNode" message="Deleting node master3 because it does not exist in the cloud provider"
E1220 04:41:48.918351       1 node_controller.go:244] Error getting instance metadata for node addresses: error fetching node by provider ID: ProviderID "" didn't match expected format "openstack:///InstanceID", and error by node name: failed to find object
E1220 04:41:49.004610       1 node_controller.go:244] Error getting instance metadata for node addresses: error fetching node by provider ID: ProviderID "" didn't match expected format "openstack:///InstanceID", and error by node name: failed to find object
E1220 04:41:49.105107       1 node_controller.go:244] Error getting instance metadata for node addresses: error fetching node by provider ID: ProviderID "" didn't match expected format "openstack:///InstanceID", and error by node name: failed to find object
E1220 04:46:49.280745       1 node_controller.go:244] Error getting instance metadata for node addresses: error fetching node by provider ID: ProviderID "" didn't match expected format "openstack:///InstanceID", and error by node name: failed to find object
E1220 04:46:49.373949       1 node_controller.go:244] Error getting instance metadata for node addresses: error fetching node by provider ID: ProviderID "" didn't match expected format "openstack:///InstanceID", and error by node name: failed to find object
E1220 04:46:49.470090       1 node_controller.go:244] Error getting instance metadata for node addresses: error fetching node by provider ID: ProviderID "" didn't match expected format "openstack:///InstanceID", and error by node name: failed to find object
I1220 04:36:34.329222       1 tlsconfig.go:178] "Loaded client CA" index=0 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"rke2-request-header-ca@1671439711\" [] issuer=\"<self>\" (2022-12-19 08:48:31 +0000 UTC to 2032-12-16 08:48:31 +0000 UTC (now=2022-12-20 04:36:34.329195781 +0000 UTC))"
I1220 04:36:34.329510       1 tlsconfig.go:200] "Loaded serving cert" certName="Generated self signed cert" certDetail="\"localhost@1671510993\" [serving] validServingFor=[127.0.0.1,127.0.0.1,localhost] issuer=\"localhost-ca@1671510992\" (2022-12-20 03:36:32 +0000 UTC to 2023-12-20 03:36:32 +0000 UTC (now=2022-12-20 04:36:34.329488541 +0000 UTC))"
I1220 04:36:34.329725       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1671510993\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1671510993\" (2022-12-20 03:36:33 +0000 UTC to 2023-12-20 03:36:33 +0000 UTC (now=2022-12-20 04:36:34.329705181 +0000 UTC))"
I1220 04:36:34.329904       1 tlsconfig.go:178] "Loaded client CA" index=0 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"rke2-client-ca@1671439711\" [] issuer=\"<self>\" (2022-12-19 08:48:31 +0000 UTC to 2032-12-16 08:48:31 +0000 UTC (now=2022-12-20 04:36:34.329852402 +0000 UTC))"
I1220 04:36:34.329943       1 tlsconfig.go:178] "Loaded client CA" index=1 certName="client-ca::kube-system::extension-apiserver-authentication::client-ca-file,client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file" certDetail="\"rke2-request-header-ca@1671439711\" [] issuer=\"<self>\" (2022-12-19 08:48:31 +0000 UTC to 2032-12-16 08:48:31 +0000 UTC (now=2022-12-20 04:36:34.329921882 +0000 UTC))"
I1220 04:36:34.330201       1 tlsconfig.go:200] "Loaded serving cert" certName="Generated self signed cert" certDetail="\"localhost@1671510993\" [serving] validServingFor=[127.0.0.1,127.0.0.1,localhost] issuer=\"localhost-ca@1671510992\" (2022-12-20 03:36:32 +0000 UTC to 2023-12-20 03:36:32 +0000 UTC (now=2022-12-20 04:36:34.330184131 +0000 UTC))"
I1220 04:36:34.330403       1 named_certificates.go:53] "Loaded SNI cert" index=0 certName="self-signed loopback" certDetail="\"apiserver-loopback-client@1671510993\" [serving] validServingFor=[apiserver-loopback-client] issuer=\"apiserver-loopback-client-ca@1671510993\" (2022-12-20 03:36:33 +0000 UTC to 2023-12-20 03:36:33 +0000 UTC (now=2022-12-20 04:36:34.330385831 +0000 UTC))"
W1220 04:38:48.941412       1 reflector.go:442] pkg/mod/k8s.io/client-go@v0.24.0/tools/cache/reflector.go:167: watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W1220 04:38:48.941480       1 reflector.go:442] pkg/mod/k8s.io/client-go@v0.24.0/tools/cache/reflector.go:167: watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W1220 04:38:48.941683       1 reflector.go:442] pkg/mod/k8s.io/client-go@v0.24.0/tools/cache/reflector.go:167: watch of *v1.ConfigMap ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

Environment:

openstack-cloud-controller-manager(or other related binary) version: chart-version: 1.4.0 app version: v1.25.0
OpenStack version: Xena
Others:

brandond commented 1 year ago

To be clear, it is actively deleting nodes from the cluster when the instance is stopped or restarted. This seems like a bad thing, especially when the nodes in question are part of the control plane and may require manual intervention to rejoin the cluster.

jichenjc commented 1 year ago

this is not a CPO OCCM behavior ,it's cloud provider did the job https://github.com/kubernetes/cloud-provider/blob/master/controllers/nodelifecycle/node_lifecycle_controller.go#L164

so from the code, the function failed ensureNodeExistsByProviderID and this align with your log

E1220 04:41:48.918351       1 node_controller.go:244] Error getting instance metadata for node addresses: error fetching node by provider ID: ProviderID "" didn't match expected format "openstack:///InstanceID", and error by node name: failed to find object

usually the node has a ProviderID like opensetack:///xxxxx-xxxx while xxxx-xxx is the uuid of the instance the ID actually comes from https://github.com/kubernetes/cloud-provider-openstack/blob/master/pkg/openstack/instances.go#L515

so could you please check whether you are using metadata service or config drive? I am wondering no data obtain from this function and reutrn "" instead .. and what about other instance (node) ? kubectl describe node might give us more info about other masters you have

we should add some log to indicate the error occur here (a follow up) but the root analysis may focus on whether you can get instance ID here and if not, why ...

jargals2 commented 1 year ago

I was told that we use metadata service. I attached the output of "kubectl describe nodes" and OCCM log. The cluster we are testing with has only CPO OCCM deployed on it. nodes_occm.zip

jichenjc commented 1 year ago

THanks for the info

I saw I1221 04:38:22.221093 1 flags.go:64] FLAG: --v="2" can you help make it -v =5 ?

we have such code , so with metadata I guess some error will be reported

func getFromMetadataService(metadataVersion string) (*Metadata, error) {
        // Try to get JSON from metadata server.
        metadataURL := getMetadataURL(metadataVersion)
        klog.V(4).Infof("Attempting to fetch metadata from %s", metadataURL)
        resp, err := http.Get(metadataURL)
        if err != nil {
                return nil, fmt.Errorf("error fetching %s: %v", metadataURL, err)
        }
        defer resp.Body.Close()

        if resp.StatusCode != http.StatusOK {
                err = fmt.Errorf("unexpected status code when reading metadata from %s: %s", metadataURL, resp.Status)
                return nil, err
        }

        return parseMetadata(resp.Body)
}

brandond commented 1 year ago

I was told that we use metadata service.

Note the comment at https://github.com/kubernetes/cloud-provider-openstack/blob/297a3fcb55c3ddd3f9cb4ac1f1bb3208f7f5e3bf/pkg/openstack/instances.go#L372-L375

This would seem to indicate that this is the expected behavior if only using metadata service. You should be using something that has access to info for ALL nodes, not just the node that the CCM is running on.

jargals2 commented 1 year ago

Thanks. I don't have much experience working with Openstack. How do I configure the CCM not to use metadata service but to use different service? @jichenjc Is that something we have to configure on the Openstack?

jichenjc commented 1 year ago

check https://docs.openstack.org/nova/latest/user/metadata.html select config drive instead then metadata will be created on your disk at pre-defined location

brandond commented 1 year ago

As far as I can tell, neither the config drive nor the instance metadata service would allow a node to retrieve information about other nodes. It would need to call out to the openstack compute service to do that. It seems like it is using github.com/gophercloud/gophercloud/openstack/compute/v2/servers to do this. Can you confirm that your nodes have access to list servers?

jichenjc commented 1 year ago

config drive will read only from local files https://github.com/kubernetes/cloud-provider-openstack/blob/master/pkg/openstack/instances.go#L524

brandond commented 1 year ago

Same for instance metadata though; it only returns data for the calling instance. So I don't see how either of these could work here.

jargals2 commented 1 year ago

As far as I can tell, neither the config drive nor the instance metadata service would allow a node to retrieve information about other nodes. It would need to call out to the openstack compute service to do that. It seems like it is using github.com/gophercloud/gophercloud/openstack/compute/v2/servers to do this. Can you confirm that your nodes have access to list servers?

I am using openstack keystone admin credentials for this. So it should be able to access to list servers.

jargals2 commented 1 year ago

Do you all have suggestion for any workaround for this?

jichenjc commented 1 year ago

can you help make it -v =5 ?

1) make -v=5 to get more logs to confirm it's metadata issue or not 2) switch to config drive per above comments to give another try

jargals2 commented 1 year ago

Here is the log with verbose parameter set --v=5. occm.log

I can't have config drive configured until next week. I will post an update then.

brandond commented 1 year ago

The logs suggest that none of your nodes have a ProviderID set. Did you bring this cluster up without the CCM deployed and then add it later? I'm curious how it ended up in this state. Do you see a ProviderID set in the node yaml?

jichenjc commented 1 year ago

ok you are using 1.23 but it should work

from the log

I1222 08:31:51.824250       1 instances.go:131] NodeAddressesByProviderID () called
I1222 08:31:51.824313       1 instances.go:116] NodeAddresses(master3) called

indicate there is no provider ID at all set , I don't know what happened in your env as Node.Spec.ProviderID is "" and lead to follow error, not sure it's because metadata service not able reach and lead to the empty value

as workaround , set openstack:/// to Node.Spec.ProviderID and see whether it works

jargals2 commented 1 year ago

The logs suggest that none of your nodes have a ProviderID set. Did you bring this cluster up without the CCM deployed and then add it later? I'm curious how it ended up in this state. Do you see a ProviderID set in the node yaml?

The way I am testing is that

Create the RKE2
Install OCCM with helm.
Shutdown one of the masters

I don't find ProviderID in nodes' yaml. How do I set the ProviderID for my nodes?

jargals2 commented 1 year ago

ok you are using 1.23 but it should work

from the log
I1222 08:31:51.824250       1 instances.go:131] NodeAddressesByProviderID () called
I1222 08:31:51.824313       1 instances.go:116] NodeAddresses(master3) called
indicate there is no provider ID at all set , I don't know what happened in your env as Node.Spec.ProviderID is "" and lead to follow error, not sure it's because metadata service not able reach and lead to the empty value

as workaround , set openstack:/// to Node.Spec.ProviderID and see whether it works

I had been testing with 1.24 until today. The issue is still the same with 1.23 Also, when OCCM is running on the cluster, I can't join another new master nodes. New master can join only after I uninstalled the OCCM.

How do I set openstack:/// to Node.Spec.ProviderID ? I did not find any environment variable with related to OCCM or provider in the OCCM pod.

brandond commented 1 year ago

The ProviderID should be set when you deploy the cluster with the cloud-provider set to "external". The nodes will block with an "uninitialized" taint until the cloud provider chart is deployed, at which point the openstack cloud-controller manager will come up and add the correct labels and providerIds to the nodes. If you are skipping this step - not setting the cloud provider to external and deploying the OpenStack cloud controller manager during initial cluster load - then that would explain why things are not working right.

jargals2 commented 1 year ago

I set up an another RKE2 cluster with "cloud-provider-name: external" option. All the masters had node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule taints, but when I turn off one of the master nodes, the issue still occurs. Still OCCM could not get the ID of any of the masters if you look at the log. occm.log I will make an update here after I configured the Openstack with config drive.

brandond commented 1 year ago

All the masters had node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule taints

That indicates that the cloud controller manager isn't coming up. When it does, that taint will be removed.

jargals2 commented 1 year ago

Configuring config drive did not help at all as well. I attached the latest log of OCCM. There are some lines in the log that has error. I am not sure if my problem is related to them.

I1227 09:06:55.911937       1 leaderelection.go:352] lock is held by worker1_44ce5b9c-2868-4b70-8760-b17c245d0af5 and has not yet expired
I1227 09:06:55.911958       1 leaderelection.go:253] failed to acquire lease kube-system/cloud-controller-manager

I don't understand these lines because I don't have any cloud-controller-manager resources in kube-system namespace.

W1227 09:06:54.433015       1 openstack.go:325] Failed to create an OpenStack Secret client: unable to initialize keymanager client for region RegionOne: No suitable endpoint could be found in the service catalog.

Apparently we don't have keymanager enabled on our Openstack, is it required? From this link , It seems that it is not needed

W1227 09:06:54.494328       1 openstack.go:445] Error initialising Routes support: router-id not set in cloud provider config

I did not find anywhere I can set this router-id in the helm chart.

Also. metadata service seems to be working fine. I can use this command on my nodes in my kubernetes cluster and returns appropriate response. curl http://169.254.169.254/openstack/2012-08-10/meta_data.json

occm.log

All the masters had node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule taints

That indicates that the cloud controller manager isn't coming up. When it does, that taint will be removed.

Taints get never removed in my cluster.

brandond commented 1 year ago

I1227 09:06:55.911937       1 leaderelection.go:352] lock is held by worker1_44ce5b9c-2868-4b70-8760-b17c245d0af5 and has not yet expired
I1227 09:06:55.911958       1 leaderelection.go:253] failed to acquire lease kube-system/cloud-controller-manager
I don't understand these lines because I don't have any cloud-controller-manager resources in kube-system namespace.

Are you sure? Did you do kubectl get lease -n kube-system?

This message indicates that you're running multiple replicas of the CCM. Do you perhaps have another CCM (possibly not even the openstack CCM) running? This would include the RKE2 default CCM, if you haven't disabled it.

jargals2 commented 1 year ago

his message indicates that you're running multiple replicas of the CCM. Do you perhaps have another CCM (possibly not even the openstack CCM) running? This would include the RKE2 default CCM, if you haven't disabled it.

I was installing and deleting the OCCM with different configuration and it was a leftover lease from previous installation of OCCM. Uninstalling occm helm chart apparently does not automatically delete the lease. I disabled the RKE default CCM.

jargals2 commented 1 year ago

I ran a dev-stack and tested OCCM on it. The problem was exact same.

as workaround , set openstack:/// to Node.Spec.ProviderID and see whether it works

Can you show me how and where I can set it? Thank you

jichenjc commented 1 year ago

I tried devstack as well, it works fine (I used cluster-api, and it's same to other k8s solutions as it's from kubeadm) so other devstack related k8s create process should be same

follow https://github.com/kubernetes-sigs/cluster-api-provider-openstack/blob/main/docs/book/src/topics/external-cloud-provider.md

I got 2 node

+--------------------------------------+-------------------------------------+--------+------------+-------------+------------------------------------------------------------------------+
| ID                                   | Name                                | Status | Task State | Power State | Networks                                                               |
+--------------------------------------+-------------------------------------+--------+------------+-------------+------------------------------------------------------------------------+
| c1631c0b-019e-47d6-8d60-aaea3c09a6cb | capi-quickstart-control-plane-qx5tx | ACTIVE | -          | Running     | k8s-clusterapi-cluster-default-capi-quickstart=10.6.0.162, 172.24.4.65 |
| 55128e47-7494-4a8b-a3bb-b717c95666f3 | capi-quickstart-md-0-bvdm2          | ACTIVE | -          | Running     | k8s-clusterapi-cluster-default-capi-quickstart=10.6.0.149

they are not ready yet

# kubectl --kubeconfig=./${CLUSTER_NAME}.kubeconfig get nodes
NAME                                  STATUS     ROLES                  AGE   VERSION
capi-quickstart-control-plane-qx5tx   NotReady   control-plane,master   12d   v1.23.10
capi-quickstart-md-0-bvdm2            NotReady   <none>                 12d   v1.23.10

apply the latest manifest file from Cloud provider openstack (OCCM)

root@jitest43:~# kubectl --kubeconfig=./${CLUSTER_NAME}.kubeconfig apply -f https://raw.githubusercontent.com/kubernetes/cloud-provider-openstack/master/manifests/controller-manager/cloud-controller-manager-roles.yaml
clusterrole.rbac.authorization.k8s.io/system:cloud-controller-manager created
clusterrole.rbac.authorization.k8s.io/system:cloud-node-controller created
root@jitest43:~# kubectl --kubeconfig=./${CLUSTER_NAME}.kubeconfig apply -f https://raw.githubusercontent.com/kubernetes/cloud-provider-openstack/master/manifests/controller-manager/cloud-controller-manager-role-bindings.yaml
clusterrolebinding.rbac.authorization.k8s.io/system:cloud-node-controller created
clusterrolebinding.rbac.authorization.k8s.io/system:cloud-controller-manager created
root@jitest43:~# kubectl --kubeconfig=./${CLUSTER_NAME}.kubeconfig apply -f https://raw.githubusercontent.com/kubernetes/cloud-provider-openstack/master/manifests/controller-manager/openstack-cloud-controller-manager-ds.yaml
serviceaccount/cloud-controller-manager created
daemonset.apps/openstack-cloud-controller-manager created

we can see the OCCM pod is creating

root@jitest43:~# kubectl --kubeconfig=./${CLUSTER_NAME}.kubeconfig get pod -n kube-system
NAME                                                          READY   STATUS              RESTARTS   AGE
calico-kube-controllers-d569cccf-xvphn                        0/1     Pending             0          115s
calico-node-9j2j5                                             1/1     Running             0          115s
calico-node-lzq59                                             1/1     Running             0          115s
coredns-6d4b75cb6d-fd2sg                                      0/1     Pending             0          12d
coredns-6d4b75cb6d-jrvzx                                      0/1     Pending             0          12d
etcd-capi-quickstart-control-plane-qx5tx                      1/1     Running             0          12d
kube-apiserver-capi-quickstart-control-plane-qx5tx            1/1     Running             0          12d
kube-controller-manager-capi-quickstart-control-plane-qx5tx   1/1     Running             0          12d
kube-proxy-2tb88                                              1/1     Running             0          12d
kube-proxy-b4sx2                                              1/1     Running             0          12d
kube-scheduler-capi-quickstart-control-plane-qx5tx            1/1     Running             0          12d
openstack-cloud-controller-manager-8vlbv                      0/1     ContainerCreating   0          7s

with the OCCM is running, other pods switch from pending to running

root@jitest43:~# kubectl --kubeconfig=./${CLUSTER_NAME}.kubeconfig get pod -n kube-system
NAME                                                          READY   STATUS              RESTARTS   AGE
calico-kube-controllers-d569cccf-xvphn                        0/1     ContainerCreating   0          2m5s
calico-node-9j2j5                                             1/1     Running             0          2m5s
calico-node-lzq59                                             1/1     Running             0          2m5s
coredns-6d4b75cb6d-fd2sg                                      1/1     Running             0          12d
coredns-6d4b75cb6d-jrvzx                                      1/1     Running             0          12d
etcd-capi-quickstart-control-plane-qx5tx                      1/1     Running             0          12d
kube-apiserver-capi-quickstart-control-plane-qx5tx            1/1     Running             0          12d
kube-controller-manager-capi-quickstart-control-plane-qx5tx   1/1     Running             0          12d
kube-proxy-2tb88                                              1/1     Running             0          12d
kube-proxy-b4sx2                                              1/1     Running             0          12d
kube-scheduler-capi-quickstart-control-plane-qx5tx            1/1     Running             0          12d
openstack-cloud-controller-manager-8vlbv                      1/1     Running             0          17s

and node become Ready

root@jitest43:~# kubectl --kubeconfig=./${CLUSTER_NAME}.kubeconfig get nodes
NAME                                  STATUS   ROLES                  AGE   VERSION
capi-quickstart-control-plane-qx5tx   Ready    control-plane,master   12d   v1.23.10
capi-quickstart-md-0-bvdm2            Ready    <none>                 12d   v1.23.10

providerID is set

root@jitest43:~# kubectl --kubeconfig=./${CLUSTER_NAME}.kubeconfig describe node capi-quickstart-control-plane-qx5tx | grep ProviderID
ProviderID:                   openstack:///c1631c0b-019e-47d6-8d60-aaea3c09a6cb

so workaround is to edit the node and set the ProviderID by kubectl edit node and add the line but I think you still need check why the OCCM didn't set the node provider ID, it should be something wrong ..

  spec:
    podCIDR: 192.168.1.0/24
    podCIDRs:
    - 192.168.1.0/24
    providerID: openstack:///55128e47-7494-4a8b-a3bb-b717c95666f3

jichenjc commented 1 year ago

all 3 errors/warnings you listed https://github.com/kubernetes/cloud-provider-openstack/issues/2062#issuecomment-1365752096 is not related to the issue, it's common warning for various setup (e.g barbican not setup etc) ,you can safely ignore

jargals2 commented 1 year ago

Finally, I figured it out. The problem was the hostnames of my nodes in my cluster. I changed every single nodes' hostnames before creating the cluster. That was the problem. Apparently, OCCM was using nodes hostname to check if it's in metadata server. Since my servers' hostname and VMs' names on Openstack are different, OCCM can't find the server with the newly changed hostname and so it was kicking them from the cluster. Thank you for helping me out. @jichenjc @brandond

jichenjc commented 1 year ago

glad you figure it out :) with above info, I will close this issue

kubernetes / cloud-provider-openstack

Cloud controller manager kicking nodes out of the cluster #2062