kubernetes / cloud-provider-openstack

Apache License 2.0
618 stars 609 forks source link

[occm] node not labeled #1790

Closed jfpucheu closed 2 years ago

jfpucheu commented 2 years ago

/kind bug

/kind feature

What happened:

Before with cloud provider openstack in kubelet my nodes where labeled with region and zone like this:

    topology.kubernetes.io/region=eu-west-0
    topology.kubernetes.io/zone=eu-west-0a

Migrated to occm I only got label:

topology.cinder.csi.openstack.org/zone=eu-west-0a

metadatas are not empty:

curl http://169.254.169.254/openstack/latest/meta_data.json
{ ..."availability_zone": "eu-west-0a", .....}

What you expected to happen:

have node labeled with topology from openstack apis like:

    topology.kubernetes.io/region=eu-west-0
    topology.kubernetes.io/zone=eu-west-0a
    topology.cinder.csi.openstack.org/zone=eu-west-0a

How to reproduce it:

deploy OCCM with basic config ( no changes ) and cinder csi check node label using : kubectl describe node mynode

Anything else we need to know?: I don't now if it it really an issue or a feature request because this part is not very documented but this feature was very convenient in kubelet cloud controllers. All nodes topology what labeled automaticaly. My OCCM don't have issue to contact Openstack api, I saw responses in logs:

I0211 21:22:53.591819       1 instances.go:432] X-Xss-Protection: 1; mode=block;
I0211 21:22:53.592045       1 instances.go:432] OpenStack Response Body: {
I0211 21:22:53.592068       1 instances.go:432]   "servers": [
I0211 21:22:53.592079       1 instances.go:432]     {
I0211 21:22:53.592090       1 instances.go:432]       "OS-DCF:diskConfig": "MANUAL",
I0211 21:22:53.592101       1 instances.go:432]       "OS-EXT-AZ:availability_zone": "eu-west-0a",
I0211 21:22:53.592111       1 instances.go:432]       "OS-EXT-SRV-ATTR:host": "pod3.eu-west-0a",
I0211 21:22:53.592122       1 instances.go:432]       "OS-EXT-SRV-ATTR:hypervisor_hostname": "nova005@3",
I0211 21:22:53.592130       1 instances.go:432]       "OS-EXT-SRV-ATTR:instance_name": "instance-00649f3f",
I0211 21:22:53.592138       1 instances.go:432]       "OS-EXT-STS:power_state": 1,
I0211 21:22:53.592149       1 instances.go:432]       "OS-EXT-STS:task_state": null,
I0211 21:22:53.592155       1 instances.go:432]       "OS-EXT-STS:vm_state": "active",
I0211 21:22:53.592163       1 instances.go:432]       "OS-SRV-USG:launched_at": "2022-02-11T20:28:41.000000",
I0211 21:22:53.592172       1 instances.go:432]       "OS-SRV-USG:terminated_at": null,
I0211 21:22:53.592179       1 instances.go:432]       "accessIPv4": "",
I0211 21:22:53.592187       1 instances.go:432]       "accessIPv6": "",
I0211 21:22:53.592197       1 instances.go:432]       "addresses": {
I0211 21:22:53.592204       1 instances.go:432]         "ba495a5a-6a0d-4164-bdf6-ecdc410e62ba": [
I0211 21:22:53.592210       1 instances.go:432]           {
I0211 21:22:53.592216       1 instances.go:432]             "OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:12:a4:ed",
I0211 21:22:53.592221       1 instances.go:432]             "OS-EXT-IPS:type": "fixed",
I0211 21:22:53.592227       1 instances.go:432]             "addr": "10.235.76.53",
I0211 21:22:53.592233       1 instances.go:432]             "version": 4
I0211 21:22:53.592243       1 instances.go:432]           }
I0211 21:22:53.592249       1 instances.go:432]         ]
I0211 21:22:53.592267       1 instances.go:432]       },

Environment:

Thanks for the help Jeff

jichenjc commented 2 years ago

topology.cinder.csi.openstack.org/zone=eu-west-0a is provided at Cinder CSI , so deploy OCCM should not help https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/cinder-csi-plugin/features.md#topology

I assume you also want to have topology.kubernetes.io/region=eu-west-0 label it should comes from https://github.com/kubernetes/cloud-provider-openstack/blob/master/pkg/openstack/openstack.go#L366 I am not sure your configuration file contains such region definition? or maybe you can check whether you can see the log and the info in it (update to log level 4 or higher)

jfpucheu commented 2 years ago

Hello,

I have the region setup:

[Global]
username=********
password=********
auth-url=********
tenant-id=******
domain-id=********
region=eu-west-0

[LoadBalancer]
enabled=false
create-monitor = no

[Metadata]
search-order = metadataService,configDrive

No logs with --v=6 : kubectl logs openstack-cloud-controller-manager-tm6vg -n kube-system | grep "Current zone" (nothing)

it seems the function is never called.

jichenjc commented 2 years ago

it seems the function is never called.

weird, need check my env soon, thanks~

jichenjc commented 2 years ago

I have this in my dev env

topology.cinder.csi.openstack.org/zone=nova
                    topology.kubernetes.io/region=RegionOne
                    topology.kubernetes.io/zone=nova

then I unlabeled

kubectl label node n1 topology.kubernetes.io/region-
kubectl label node n1 topology.kubernetes.io/zone-

and after a while I can see the label added again

$ kubectl describe node
Name:               n1
Roles:              control-plane,master
......
                    topology.cinder.csi.openstack.org/zone=nova
                    topology.kubernetes.io/region=RegionOne
                    topology.kubernetes.io/zone=nova

at I find this

ubuntu@n1:~$ kubectl logs openstack-cloud-controller-manager-7sknr -n kube-system  | grep label
I0216 03:33:06.619486       1 labels.go:56] Updated labels map[topology.kubernetes.io/region:RegionOne topology.kubernetes.io/zone:nova] to Node n1

can you check whether you have this log in your env? thanks

jfpucheu commented 2 years ago

Hello,

still not have any log about that.

is it possible that i don't get metadata because i don't have external load balancer supported ?


I0216 07:39:44.056331       1 openstack.go:310] openstack.LoadBalancer() called
E0216 07:39:44.056364       1 openstack.go:326] Failed to create an OpenStack LoadBalancer client: failed to find load-balancer v2  endpoint for region eu-west-0: No suitable endpoint could be found in the
 service catalog.
E0216 07:39:44.056381       1 core.go:93] Failed to start service controller: the cloud provider does not support external load balancers
W0216 07:39:44.056390       1 controllermanager.go:286] Skipping "service"

Thanks jeff

jichenjc commented 2 years ago

the above error should not impact as it only tells no LB service defined and you are not able to create LB service ,but it should not impact OCCM running

https://github.com/kubernetes/cloud-provider/blob/master/controllers/node/node_controller.go#L270 is the code that set labels from https://github.com/kubernetes/cloud-provider/blob/master/controllers/node/node_controller.go#L53

You said you are using 1.23 so should be up to date already, I have no idea why the reconcile not working maybe consider to add some logs and do some debug will be helpful here or @lingxiankong @ramineni might know more

lingxiankong commented 2 years ago

Restart the openstack-cloud-controller-manager with --v=6, and search the log starting with Initializing node and Successfully initialized node, please paste all the logs in between.

Maellooou commented 2 years ago

Hello,

After futher investigation, I found the issue is link to this code on cloud-provider/node_controller.go (https://github.com/kubernetes/cloud-provider/blob/master/controllers/node/node_controller.go -> l. 384)

cloudTaint := getCloudTaint(curNode.Spec.Taints)
if cloudTaint == nil {
             klog.Info("LOG MORE - err syncNode cloudTaint")
    // Node object received from event had the cloud taint but was outdated,
    // the node has actually already been initialized, so this sync event can be ignored.
    return nil
}

I added more log and I can see the log :

I0217 15:12:35.047625       1 instances.go:156] NodeAddressesByProviderID(openstack:///004f593d-7291-40f4-9075-eedcfa25f2c1) => [{InternalIP xx.xx.xx.xx}]
I0217 15:12:35.219042       1 node_controller.go:412] LOG MORE - err syncNode cloudTaint
I0217 15:12:36.317768       1 round_trippers.go:553] GET https://10.254.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cloud-controller-manager?timeout=5s 200 OK in 5 milliseconds
I0217 15:12:36.325470       1 round_trippers.go:553] PUT https://10.254.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cloud-controller-manager?timeout=5s 200 OK in 7 milliseconds
I0217 15:12:36.325568       1 leaderelection.go:278] successfully renewed lease kube-system/cloud-controller-manager

If I comment those lines I have no more issue and the initialization works :

I0217 15:06:23.189333       1 node_controller.go:419] Initializing node kdevnodeaz0a01 with cloud provider
I0217 15:06:24.051665       1 node_controller.go:522] Adding node label from cloud provider: beta.kubernetes.io/instance-type=m2.xlarge.8
I0217 15:06:24.051675       1 node_controller.go:523] Adding node label from cloud provider: node.kubernetes.io/instance-type=m2.xlarge.8
I0217 15:06:24.051683       1 node_controller.go:534] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/zone=eu-west-0a
I0217 15:06:24.051693       1 node_controller.go:535] Adding node label from cloud provider: topology.kubernetes.io/zone=eu-west-0a
I0217 15:06:24.051700       1 node_controller.go:545] Adding node label from cloud provider: failure-domain.beta.kubernetes.io/region=eu-west-0
I0217 15:06:24.051705       1 node_controller.go:546] Adding node label from cloud provider: topology.kubernetes.io/region=eu-west-0
I0217 15:06:24.065453       1 node_controller.go:484] Successfully initialized node kdevnodeaz0a01 with cloud provider

But I don't understand why we are on this failed case...

Do you have any idea ?

Maellooou commented 2 years ago

To complete my previous comment, once the labels have been added, if I delete them, there are automaticaly added like @jichenjc mentioned before.

jichenjc commented 2 years ago

The code you pointed out actually use https://github.com/kubernetes/cloud-provider/blob/0429a85a45b2424c1508ea289fea6d1e8f15d30f/api/well_known_taints.go#L24

which means if the node is not initialized , then it will be init since the node is actually inited already (that's why the node taint is None in my env)

$ kubectl get nodes -A ..... Taints:

so I doubt whether it's the root cause as per my test env I delete that label then it's able to recreate again, for now, without the removal of the code you mentioned, can you remove the lable and will the OCCM create for you?

Maellooou commented 2 years ago

To be clear, it's mean you can't deploy the OCCM on the cluster already existing ? My nodes are already created. This is why the init is not done...

jichenjc commented 2 years ago

not sure, I never used in-tree to external cloud provider (always use external directly)

there is a video created by @lingxiankong , maybe he has more info

https://github.com/kubernetes/cloud-provider-openstack/blob/master/docs/openstack-cloud-controller-manager/using-openstack-cloud-controller-manager.md#migrating-from-in-tree-openstack-cloud-provider-to-external-openstack-cloud-controller-manager

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue.

In response to [this](https://github.com/kubernetes/cloud-provider-openstack/issues/1790#issuecomment-1187019870): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.