Closed parogui closed 1 year ago
Thanks for opening up. About this issue, when initializing the node, the Cloud-provider-vsphere will first try find the providerID of each node by searching with node name. Related code. Then the Cloud provider vSphere will try find VM by searching by DNS Name. And only return the first result from the list
I noticed the node name you use in the cluster are form like master0
master1
, I wonder are they within the same vSphere, if that's the case, then it could be a case that CPI find different node with same (node name)/(dns name) in the same vsphere.
In CAPV we use distinct node name for each cluster node, each node also has distinct DNS name, so each cluster won't collide with same node name. I wonder how could different VM has same name in Openshift cluster in vSphere? Or they are using different VM name but same DNS name? Could you share those VM information? Does nodes in different clusters has same DNS name? You are using single VC right?
The log is missing some of the initialization parts, so I can't verify that the duplicate DNS Name is the root cause. Could you create a new cluster and install CPI with log level setting to 5, then return the log to me? I was expecting to see log in below code
https://github.com/kubernetes/cloud-provider-vsphere/blob/master/pkg/cloudprovider/vsphere/instances.go#L98 https://github.com/kubernetes/cloud-provider/blob/master/controllers/node/node_controller.go#L415
So we could see during the initialization, CPI is returning the wrong provider ID for corresponding node name or not.
Hi, @lubronzhan thanks for your update.
Our folder structure in VC looks like this (e.g. for the clusters int-ocp and tst-ocp; we have 11 clusters in total):
/dc-openshift_tst/vm/int-ocp/master0_int-ocp
/dc-openshift_tst/vm/int-ocp/master1_int-ocp
/dc-openshift_tst/vm/int-ocp/master2_int-ocp
/dc-openshift_tst/vm/int-ocp/worker0_int-ocp
/dc-openshift_tst/vm/int-ocp/worker1_int-ocp
...
/dc-openshift_tst/vm/int-ocp/worker(n)_int-ocp
/dc-openshift_tst/vm/tst-ocp/master0_tst-ocp
/dc-openshift_tst/vm/tst-ocp/master1_tst-ocp
/dc-openshift_tst/vm/tst-ocp/master2_tst-ocp
/dc-openshift_tst/vm/tst-ocp/worker0_tst-ocp
/dc-openshift_tst/vm/tst-ocp/worker1_tst-ocp
...
/dc-openshift_tst/vm/tst-ocp-acpr/worker(n)_tst-ocp
The nodes of each cluster have the following hostnames:
master0
master1
master2
worker0
worker1
...
worker(n)
The DNS records look like this:
master0.int-ocp.my.org.name
worker0.int-ocp.my.org.name
...
master0.tst-ocp.my.org.name
worker0.tst-ocp.my.org.name
All the nodes are on the same vSphere but in different folders. They may share hostname (master(n), worker(n)...) but not FQDN.
Does it sound like something may causing the reported issue?
Hi @parogui Unfortunately, looks like CPI using the hostname as node identifier at the very begining. Is it possible to change VM to have distinct Hostname in Openshift? If you check kubelet log, probably it's telling it try to register node with this name, so that's why CPI is reconciling using this name
Hi Lubron, thanks for your answer.
Doesn't it look for the full FQDN? It makes sense to simplify the hostname on hosts across clusters to make it easier to maintain infra if the names are recognizable.
Hi @parogui at the beginning, kubelet will register the node, and the in your setup node name is the hostname. CPI can only use this name to find corresponding VM for the first time. CPI couldn't get FQDN from the node that kubelet registered. Unless kubelet use the full FQDN as the node name, otherwise CPI can't get the correct VM. So it's up to kubelet to provide the distinct identification
Hi @lubronzhan. Thanks for the update.
I'm checking the Kubernetes docs and it looks like it is the CPI who requests what is going to be used for the hostname field (--cloud-provider field)
https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/
Is there any configuration applied from the CPI side to determine the hostname?
Thanks!
Hi @parogui CPI will set the hostname field on the node, and the hostname is get from VM, so it doesn't mean the CPI will set the hostname of the VM, it just fetch the hostName from VM once it locate the exact VM. As you can see here https://github.com/kubernetes/cloud-provider-vsphere/blob/master/pkg/cloudprovider/vsphere/nodemanager.go#L257-L262.
But before it sets the hostname on Node, it needs to locate the VM. You can see CPI use the node name to locate the VM at the initialization before it has the ProviderUUID. We didn't assume node name is Hostname, because we need to support multi-VC deployment, hostname could be same on different IAAS with different domain name, but DNS will be identical.
status:
addresses:
- address: tkg-mgmt-vc-2m4tr-9gfp6
type: Hostname
- address: 10.180.205.188
type: InternalIP
- address: 10.180.205.188
type: ExternalIP
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
What happened?
After installing CSI driver, we noticed CPI is discovering nodes out of the cluster (OpenShift in this case) VMs folder.
If the nodes are manually deleted, they are re-discovered.
The only way we were able to workaround the issue is by reducing the permissions on the user used for vSphere so it only has access to its cluster, making us require a per-cluster vSphere user with permissions to their own folder.
What did you expect to happen?
We'd expect the CPI discovery to hone the folder configuration on the vSphere CSI driver configmap to prevent discovering nodes that are not supposed to be present in node discovery.
How can we reproduce it (as minimally and precisely as possible)?
Install vSphere CSI driver on OpenShift 4.x on a vSphere cluster where more than one vSphere cluster co-exists on different folders.
Anything else we need to know (please consider providing level 4 or above logs of CPI)?
Every 5 minutes node discovery is run and adds the nodes that were removed because they do not belong to the cluster.
Kubernetes version
Cloud provider or hardware configuration
OS version
Kernel (e.g.
uname -a
)Install tools
Container runtime (CRI) and and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
Others