Closed jargals2 closed 1 year ago
To be clear, it is actively deleting nodes from the cluster when the instance is stopped or restarted. This seems like a bad thing, especially when the nodes in question are part of the control plane and may require manual intervention to rejoin the cluster.
this is not a CPO OCCM behavior ,it's cloud provider did the job https://github.com/kubernetes/cloud-provider/blob/master/controllers/nodelifecycle/node_lifecycle_controller.go#L164
so from the code, the function failed ensureNodeExistsByProviderID
and this align with your log
E1220 04:41:48.918351 1 node_controller.go:244] Error getting instance metadata for node addresses: error fetching node by provider ID: ProviderID "" didn't match expected format "openstack:///InstanceID", and error by node name: failed to find object
usually the node has a ProviderID like opensetack:///xxxxx-xxxx while xxxx-xxx is the uuid of the instance the ID actually comes from https://github.com/kubernetes/cloud-provider-openstack/blob/master/pkg/openstack/instances.go#L515
so could you please check whether you are using metadata service or config drive? I am wondering no data obtain from this function and reutrn "" instead .. and what about other instance (node) ? kubectl describe node might give us more info about other masters you have
we should add some log to indicate the error occur here (a follow up) but the root analysis may focus on whether you can get instance ID here and if not, why ...
I was told that we use metadata service. I attached the output of "kubectl describe nodes" and OCCM log. The cluster we are testing with has only CPO OCCM deployed on it. nodes_occm.zip
THanks for the info
I saw I1221 04:38:22.221093 1 flags.go:64] FLAG: --v="2"
can you help make it -v =5 ?
we have such code , so with metadata I guess some error will be reported
func getFromMetadataService(metadataVersion string) (*Metadata, error) {
// Try to get JSON from metadata server.
metadataURL := getMetadataURL(metadataVersion)
klog.V(4).Infof("Attempting to fetch metadata from %s", metadataURL)
resp, err := http.Get(metadataURL)
if err != nil {
return nil, fmt.Errorf("error fetching %s: %v", metadataURL, err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
err = fmt.Errorf("unexpected status code when reading metadata from %s: %s", metadataURL, resp.Status)
return nil, err
}
return parseMetadata(resp.Body)
}
I was told that we use metadata service.
Note the comment at https://github.com/kubernetes/cloud-provider-openstack/blob/297a3fcb55c3ddd3f9cb4ac1f1bb3208f7f5e3bf/pkg/openstack/instances.go#L372-L375
This would seem to indicate that this is the expected behavior if only using metadata service. You should be using something that has access to info for ALL nodes, not just the node that the CCM is running on.
Thanks. I don't have much experience working with Openstack. How do I configure the CCM not to use metadata service but to use different service? @jichenjc Is that something we have to configure on the Openstack?
check https://docs.openstack.org/nova/latest/user/metadata.html select config drive instead then metadata will be created on your disk at pre-defined location
As far as I can tell, neither the config drive nor the instance metadata service would allow a node to retrieve information about other nodes. It would need to call out to the openstack compute service to do that. It seems like it is using github.com/gophercloud/gophercloud/openstack/compute/v2/servers
to do this. Can you confirm that your nodes have access to list servers?
config drive will read only from local files https://github.com/kubernetes/cloud-provider-openstack/blob/master/pkg/openstack/instances.go#L524
Same for instance metadata though; it only returns data for the calling instance. So I don't see how either of these could work here.
As far as I can tell, neither the config drive nor the instance metadata service would allow a node to retrieve information about other nodes. It would need to call out to the openstack compute service to do that. It seems like it is using
github.com/gophercloud/gophercloud/openstack/compute/v2/servers
to do this. Can you confirm that your nodes have access to list servers?
I am using openstack keystone admin credentials for this. So it should be able to access to list servers.
Do you all have suggestion for any workaround for this?
can you help make it -v =5 ?
1) make -v=5 to get more logs to confirm it's metadata issue or not 2) switch to config drive per above comments to give another try
Here is the log with verbose parameter set --v=5. occm.log
I can't have config drive configured until next week. I will post an update then.
The logs suggest that none of your nodes have a ProviderID set. Did you bring this cluster up without the CCM deployed and then add it later? I'm curious how it ended up in this state. Do you see a ProviderID set in the node yaml?
ok you are using 1.23 but it should work
from the log
I1222 08:31:51.824250 1 instances.go:131] NodeAddressesByProviderID () called
I1222 08:31:51.824313 1 instances.go:116] NodeAddresses(master3) called
indicate there is no provider ID at all set , I don't know what happened in your env as Node.Spec.ProviderID is "" and lead to follow error, not sure it's because metadata service not able reach and lead to the empty value
as workaround , set openstack:///
The logs suggest that none of your nodes have a ProviderID set. Did you bring this cluster up without the CCM deployed and then add it later? I'm curious how it ended up in this state. Do you see a ProviderID set in the node yaml?
The way I am testing is that
I don't find ProviderID in nodes' yaml. How do I set the ProviderID for my nodes?
ok you are using 1.23 but it should work
from the log
I1222 08:31:51.824250 1 instances.go:131] NodeAddressesByProviderID () called I1222 08:31:51.824313 1 instances.go:116] NodeAddresses(master3) called
indicate there is no provider ID at all set , I don't know what happened in your env as Node.Spec.ProviderID is "" and lead to follow error, not sure it's because metadata service not able reach and lead to the empty value
as workaround , set openstack:/// to Node.Spec.ProviderID and see whether it works
I had been testing with 1.24 until today. The issue is still the same with 1.23 Also, when OCCM is running on the cluster, I can't join another new master nodes. New master can join only after I uninstalled the OCCM.
How do I set openstack:/// to Node.Spec.ProviderID ? I did not find any environment variable with related to OCCM or provider in the OCCM pod.
The ProviderID should be set when you deploy the cluster with the cloud-provider set to "external". The nodes will block with an "uninitialized" taint until the cloud provider chart is deployed, at which point the openstack cloud-controller manager will come up and add the correct labels and providerIds to the nodes. If you are skipping this step - not setting the cloud provider to external and deploying the OpenStack cloud controller manager during initial cluster load - then that would explain why things are not working right.
I set up an another RKE2 cluster with "cloud-provider-name: external" option. All the masters had node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
taints, but when I turn off one of the master nodes, the issue still occurs. Still OCCM could not get the ID of any of the masters if you look at the log.
occm.log
I will make an update here after I configured the Openstack with config drive.
All the masters had node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule taints
That indicates that the cloud controller manager isn't coming up. When it does, that taint will be removed.
Configuring config drive did not help at all as well. I attached the latest log of OCCM. There are some lines in the log that has error. I am not sure if my problem is related to them.
I1227 09:06:55.911937 1 leaderelection.go:352] lock is held by worker1_44ce5b9c-2868-4b70-8760-b17c245d0af5 and has not yet expired
I1227 09:06:55.911958 1 leaderelection.go:253] failed to acquire lease kube-system/cloud-controller-manager
I don't understand these lines because I don't have any cloud-controller-manager resources in kube-system namespace.
W1227 09:06:54.433015 1 openstack.go:325] Failed to create an OpenStack Secret client: unable to initialize keymanager client for region RegionOne: No suitable endpoint could be found in the service catalog.
Apparently we don't have keymanager enabled on our Openstack, is it required? From this link , It seems that it is not needed
W1227 09:06:54.494328 1 openstack.go:445] Error initialising Routes support: router-id not set in cloud provider config
I did not find anywhere I can set this router-id in the helm chart.
Also. metadata service seems to be working fine. I can use this command on my nodes in my kubernetes cluster and returns appropriate response.
curl http://169.254.169.254/openstack/2012-08-10/meta_data.json
All the masters had node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule taints
That indicates that the cloud controller manager isn't coming up. When it does, that taint will be removed.
Taints get never removed in my cluster.
I1227 09:06:55.911937 1 leaderelection.go:352] lock is held by worker1_44ce5b9c-2868-4b70-8760-b17c245d0af5 and has not yet expired I1227 09:06:55.911958 1 leaderelection.go:253] failed to acquire lease kube-system/cloud-controller-manager
I don't understand these lines because I don't have any cloud-controller-manager resources in kube-system namespace.
Are you sure? Did you do kubectl get lease -n kube-system
?
This message indicates that you're running multiple replicas of the CCM. Do you perhaps have another CCM (possibly not even the openstack CCM) running? This would include the RKE2 default CCM, if you haven't disabled it.
his message indicates that you're running multiple replicas of the CCM. Do you perhaps have another CCM (possibly not even the openstack CCM) running? This would include the RKE2 default CCM, if you haven't disabled it.
I was installing and deleting the OCCM with different configuration and it was a leftover lease from previous installation of OCCM. Uninstalling occm helm chart apparently does not automatically delete the lease. I disabled the RKE default CCM.
I ran a dev-stack and tested OCCM on it. The problem was exact same.
as workaround , set openstack:/// to Node.Spec.ProviderID and see whether it works
Can you show me how and where I can set it? Thank you
I tried devstack as well, it works fine (I used cluster-api, and it's same to other k8s solutions as it's from kubeadm) so other devstack related k8s create process should be same
I got 2 node
+--------------------------------------+-------------------------------------+--------+------------+-------------+------------------------------------------------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-------------------------------------+--------+------------+-------------+------------------------------------------------------------------------+
| c1631c0b-019e-47d6-8d60-aaea3c09a6cb | capi-quickstart-control-plane-qx5tx | ACTIVE | - | Running | k8s-clusterapi-cluster-default-capi-quickstart=10.6.0.162, 172.24.4.65 |
| 55128e47-7494-4a8b-a3bb-b717c95666f3 | capi-quickstart-md-0-bvdm2 | ACTIVE | - | Running | k8s-clusterapi-cluster-default-capi-quickstart=10.6.0.149
they are not ready yet
# kubectl --kubeconfig=./${CLUSTER_NAME}.kubeconfig get nodes
NAME STATUS ROLES AGE VERSION
capi-quickstart-control-plane-qx5tx NotReady control-plane,master 12d v1.23.10
capi-quickstart-md-0-bvdm2 NotReady <none> 12d v1.23.10
apply the latest manifest file from Cloud provider openstack (OCCM)
root@jitest43:~# kubectl --kubeconfig=./${CLUSTER_NAME}.kubeconfig apply -f https://raw.githubusercontent.com/kubernetes/cloud-provider-openstack/master/manifests/controller-manager/cloud-controller-manager-roles.yaml
clusterrole.rbac.authorization.k8s.io/system:cloud-controller-manager created
clusterrole.rbac.authorization.k8s.io/system:cloud-node-controller created
root@jitest43:~# kubectl --kubeconfig=./${CLUSTER_NAME}.kubeconfig apply -f https://raw.githubusercontent.com/kubernetes/cloud-provider-openstack/master/manifests/controller-manager/cloud-controller-manager-role-bindings.yaml
clusterrolebinding.rbac.authorization.k8s.io/system:cloud-node-controller created
clusterrolebinding.rbac.authorization.k8s.io/system:cloud-controller-manager created
root@jitest43:~# kubectl --kubeconfig=./${CLUSTER_NAME}.kubeconfig apply -f https://raw.githubusercontent.com/kubernetes/cloud-provider-openstack/master/manifests/controller-manager/openstack-cloud-controller-manager-ds.yaml
serviceaccount/cloud-controller-manager created
daemonset.apps/openstack-cloud-controller-manager created
we can see the OCCM pod is creating
root@jitest43:~# kubectl --kubeconfig=./${CLUSTER_NAME}.kubeconfig get pod -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-d569cccf-xvphn 0/1 Pending 0 115s
calico-node-9j2j5 1/1 Running 0 115s
calico-node-lzq59 1/1 Running 0 115s
coredns-6d4b75cb6d-fd2sg 0/1 Pending 0 12d
coredns-6d4b75cb6d-jrvzx 0/1 Pending 0 12d
etcd-capi-quickstart-control-plane-qx5tx 1/1 Running 0 12d
kube-apiserver-capi-quickstart-control-plane-qx5tx 1/1 Running 0 12d
kube-controller-manager-capi-quickstart-control-plane-qx5tx 1/1 Running 0 12d
kube-proxy-2tb88 1/1 Running 0 12d
kube-proxy-b4sx2 1/1 Running 0 12d
kube-scheduler-capi-quickstart-control-plane-qx5tx 1/1 Running 0 12d
openstack-cloud-controller-manager-8vlbv 0/1 ContainerCreating 0 7s
with the OCCM is running, other pods switch from pending to running
root@jitest43:~# kubectl --kubeconfig=./${CLUSTER_NAME}.kubeconfig get pod -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-d569cccf-xvphn 0/1 ContainerCreating 0 2m5s
calico-node-9j2j5 1/1 Running 0 2m5s
calico-node-lzq59 1/1 Running 0 2m5s
coredns-6d4b75cb6d-fd2sg 1/1 Running 0 12d
coredns-6d4b75cb6d-jrvzx 1/1 Running 0 12d
etcd-capi-quickstart-control-plane-qx5tx 1/1 Running 0 12d
kube-apiserver-capi-quickstart-control-plane-qx5tx 1/1 Running 0 12d
kube-controller-manager-capi-quickstart-control-plane-qx5tx 1/1 Running 0 12d
kube-proxy-2tb88 1/1 Running 0 12d
kube-proxy-b4sx2 1/1 Running 0 12d
kube-scheduler-capi-quickstart-control-plane-qx5tx 1/1 Running 0 12d
openstack-cloud-controller-manager-8vlbv 1/1 Running 0 17s
and node become Ready
root@jitest43:~# kubectl --kubeconfig=./${CLUSTER_NAME}.kubeconfig get nodes
NAME STATUS ROLES AGE VERSION
capi-quickstart-control-plane-qx5tx Ready control-plane,master 12d v1.23.10
capi-quickstart-md-0-bvdm2 Ready <none> 12d v1.23.10
providerID is set
root@jitest43:~# kubectl --kubeconfig=./${CLUSTER_NAME}.kubeconfig describe node capi-quickstart-control-plane-qx5tx | grep ProviderID
ProviderID: openstack:///c1631c0b-019e-47d6-8d60-aaea3c09a6cb
so workaround is to edit the node and set the ProviderID by kubectl edit node
and add the line
but I think you still need check why the OCCM didn't set the node provider ID, it should be something wrong ..
spec:
podCIDR: 192.168.1.0/24
podCIDRs:
- 192.168.1.0/24
providerID: openstack:///55128e47-7494-4a8b-a3bb-b717c95666f3
all 3 errors/warnings you listed https://github.com/kubernetes/cloud-provider-openstack/issues/2062#issuecomment-1365752096 is not related to the issue, it's common warning for various setup (e.g barbican not setup etc) ,you can safely ignore
Finally, I figured it out. The problem was the hostnames of my nodes in my cluster. I changed every single nodes' hostnames before creating the cluster. That was the problem. Apparently, OCCM was using nodes hostname to check if it's in metadata server. Since my servers' hostname and VMs' names on Openstack are different, OCCM can't find the server with the newly changed hostname and so it was kicking them from the cluster. Thank you for helping me out. @jichenjc @brandond
glad you figure it out :) with above info, I will close this issue
Is this a BUG REPORT or FEATURE REQUEST?:
What happened: I have a kubernetes cluster with 3 masters and 3 workers. After installing the openstack cloud controller manager, whenever node reboot takes a bit longer or whenever a node shuts down, it is removed completely. I can't join another node to the cluster. The problem goes away when I uninstalled the openstack cloud controller manager.
What you expected to happen: With openstack cloud controller manager, shutting down node for a bit and restarting it would let automatically join to the cluster. New nodes should be able to join the cluster without a problem.
How to reproduce it:
Anything else we need to know?: Installed the openstack cloud controller manager using the following command:
Here is the log of CCM after a node is kicked
Environment: