Closed crowne closed 1 year ago
accepts the default configs (kubernetes 1.21.9+rke2r1, calico, cloud-provider=none) click create wait for cluster to be provisioned navigate to Apps Marketplace/Charts install vSphere CPI install vSphere CSI
This is not the expected way to install these charts. You should instead select "rancher-vsphere" as the cloud provider when creating the downstream cluster. This will install the version of the vsphere charts that is bundled with RKE2, and inject the appropriate cluster configuration.
Thanks Brandon, I think I avoided that option because it wasn't clear where to apply the config referred to in the message: "Configure the vSphere Cloud Provider and Storage Provider options in the Add-On Config tab."
I've seen this problem referred to here 35777.
@SoarinFerret says that it works when the CPI and CSI YAML is added manually, I don't know how to add the YAML manually, should it be pasted into the Additional Manifest field? What is the YAML meant to look like is there a working sample somewhere that I could take a look at?
The Add-On Config tab appears as follows:
Yes, in the Additional Manifests section you should provide a HelmChartConfig manifest with information on your vSphere instance:
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: rancher-vsphere-csi
namespace: kube-system
spec:
valuesContent: |-
vCenter:
host: "vcsa-xxxxx"
datacenters: "datacenter"
username: "xxxxx"
password: "xxxx"
clusterId: "rke2clutest"
configSecret:
generate: true
storageClass:
datastoreURL: "ds:///vmfs/volumes/xxxxxxxxxxxxxxxxxxxxxxxx/"
@brandond Thanks for the help, I added the YAML above, I modified it and added config for rancher-vsphere-cpi followed by separator then config for rancher-vsphere-csi. However the installation for the first controller node in the cluster doesn't complete, the logs keep scrolling with the following message
[INFO ] provisioning bootstrap node(s) vs2-controller-78d5fb5c46-xlmmz: waiting for cluster agent to be available
[INFO ] non-ready bootstrap machine(s) vs2-controller-78d5fb5c46-xlmmz: waiting for cluster agent to be available and join url to be available on bootstrap node
[INFO ] provisioning bootstrap node(s) vs2-controller-78d5fb5c46-xlmmz: waiting for cluster agent to be available
[INFO ] non-ready bootstrap machine(s) vs2-controller-78d5fb5c46-xlmmz: waiting for cluster agent to be available and join url to be available on bootstrap node
Do you have any hints on how I should try to solve this?
That looks like logs from rancher-system-agent, is that correct? Can you look at the logs for rke2-server and the output of head -n -1 /var/log/pods/kube-system_*/*/*.log
?
Yes, the logs above are from the Rancher / Cluster / Provisioning Log screen I've attached the logs you mentioned, apologies they are quite verbose, I had left the server running but I've deleted a lot of the repeated lines. head_logs.txt
I'm not seeing any logs from the vsphere CPI, although the helm chart appears to have been installed successfully. The CSI however is complaining about the vsphere host not being set, which makes me suspect that the config file you set via the HelmChartConfig is perhaps not formatted properly:
==> /var/log/pods/kube-system_vsphere-csi-node-fprtw_12d3fad4-e5df-4c06-b16e-353b09c25a6b/vsphere-csi-node/0.log <==
2022-02-28T15:52:33.169771347Z stderr F {"level":"info","time":"2022-02-28T15:52:33.169404564Z","caller":"config/config.go:373","msg":"Could not stat /etc/cloud/csi-vsphere.conf, reading config params from env","TraceId":"b2ef7e54-9e14-4a3f-ad30-1ae49c6fbc78"}
2022-02-28T15:52:33.169778911Z stderr F {"level":"error","time":"2022-02-28T15:52:33.169429535Z","caller":"config/config.go:263","msg":"no Virtual Center hosts defined","TraceId":"b2ef7e54-9e14-4a3f-ad30-1ae49c6fbc78"
2022-02-28T15:52:33.16980536Z stderr F {"level":"error","time":"2022-02-28T15:52:33.169487748Z","caller":"config/config.go:377","msg":"Failed to get config params from env. Err: no Virtual Center hosts defined","TraceId":"b2ef7e54-9e14-4a3f-ad30-1ae49c6fbc78"
Can you share the output of:
I had to install kubectl manually, the cluster isn't provisioned yet so all kubectl commands respond with
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The root cause of the problem seems to be the first error that you highlighted above "Could not stat /etc/cloud/csi-vsphere.conf", it then tries to resolve with environment vars which obviously doesn't work.
I'm not sure why the file is not available, my Additional Manifest looks ok to me:
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: rancher-vsphere-cpi
namespace: kube-system
spec:
valuesContent: |-
vCenter:
host: "10.99.99.4"
port: 443
insecureFlag: "1"
datacenters: "DC2-Loc"
username: "vsuser@vsphere.local"
password: "******"
credentialsSecret:
generate: true
---
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: rancher-vsphere-csi
namespace: kube-system
spec:
valuesContent: |-
vCenter:
host: "10.99.99.4"
port: 443
insecureFlag: "1"
datacenters: "DC2-Loc"
username: "vsuser@vsphere.local"
password: "******"
clusterId: "vs3"
configSecret:
generate: true
storageClass:
datastoreURL: ds:///vmfs/volumes/5e4bac8b-c0242362-d279-44a8427e1bb5/
@rancher-max does this look correct to you? Any tips to offer?
I had to install kubectl manually, the cluster isn't provisioned yet so all kubectl commands respond with
The connection to the server localhost:8080 was refused - did you specify the right host or port?
You should run: export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml KUBECONFIG=/etc/rancher/rke2/rke2.yaml PATH=$PATH:/var/lib/rancher/rke2/bin
- once doing that you should be able to use kubectl on the server.
Hmmm one thing to try is possibly adding this additional part of the valuesContent to the csi config (posting larger snippet so it's clear where it goes):
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: rancher-vsphere-csi
namespace: kube-system
spec:
valuesContent: |-
csiController:
nodeSelector:
node-role.kubernetes.io/control-plane: "true"
@brandond thanks for the kubectl path settings. @rancher-max I tried with the additional csiController config, but I still get the same problem.
It still looks to me like the root cause is "Could not stat /etc/cloud/csi-vsphere.conf, reading config params from env"
I also tried creating the vsphere.conf and csi-vsphere.conf files in /etc/cloud/ using the cloud-init settings instead of the Additional Manifest and I could see them when I ssh into the vm, but the logs still give the same message above. It subsequently occurred to me that the process which is writing to the vsphere-csi-node/0.log is probably running in a container on the vm so it doesn't see the files which I created with cloud-init.
It looks like the vsphere-config-secret is not being created correctly. It appears to have blank default values.
Could this be because of the Additional Manifest containing a separator?
root@vs6-ctl-a15936a0-bfx78:/home/docker# kubectl get secrets -n kube-system
...
sh.helm.release.v1.rke2-coredns.v1 helm.sh/release.v1 1 153m
statefulset-controller-token-txhpd kubernetes.io/service-account-token 3 154m
ttl-after-finished-controller-token-mhn2w kubernetes.io/service-account-token 3 154m
ttl-controller-token-mgqg4 kubernetes.io/service-account-token 3 154m
vs6-ctl-a15936a0-bfx78.node-password.rke2 Opaque 1 154m
vsphere-config-secret Opaque 1 153m
vsphere-cpi-creds Opaque 2 153m
vsphere-csi-controller-token-lwtj5 kubernetes.io/service-account-token 3 153m
vsphere-csi-node-token-kvfh6 kubernetes.io/service-account-token 3 153m
root@vs6-ctl-a15936a0-bfx78:/home/docker#
root@vs6-ctl-a15936a0-bfx78:/home/docker# kubectl get secret vsphere-config-secret -n kube-system -o jsonpath="{$.data.csi-vsphere\.conf}" | base64 --decode
[Global]
cluster-id = "c-m-kkgmr27x"
user = ""
password = ""
port = "443"
insecure-flag = "1"
[VirtualCenter ""]
datacenters = ""
root@vs6-ctl-a15936a0-bfx78:/home/docker#
root@vs6-ctl-a15936a0-bfx78:/home/docker#
Additional Manifest
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: rancher-vsphere-cpi
namespace: kube-system
spec:
valuesContent: |-
vCenter:
host: "10.99.99.4"
port: 443
insecureFlag: "1"
datacenters: "DC2-Loc"
username: "vsuser@vsphere.local"
password: "******"
credentialsSecret:
generate: true
---
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: rancher-vsphere-csi
namespace: kube-system
spec:
valuesContent: |-
vCenter:
host: "10.99.99.4"
port: 443
insecureFlag: "1"
datacenters: "DC2-Loc"
username: "vsuser@vsphere.local"
password: "******"
clusterId: "vs6"
configSecret:
generate: true
csiController:
nodeSelector:
node-role.kubernetes.io/control-plane: "true"
storageClass:
datastoreURL: ds:///vmfs/volumes/5e4bac8b-c0242362-d279-44a8427e1bb5/
Can you check the contents of the vsphere-cpi-creds
and vsphere-config-secret
secrets? It looks like the former is what the CPI chart should create.
I can confirm that vsphere-cpi-creds
is OK, I decoded the values and they are correct.
vsphere-config-secret
looks like it has default and empty values
vsphere-config-secret Opaque 1 9h
vsphere-cpi-creds Opaque 2 9h
vsphere-csi-controller-token-lwtj5 kubernetes.io/service-account-token 3 9h
vsphere-csi-node-token-kvfh6 kubernetes.io/service-account-token 3 9h
root@vs6-ctl-a15936a0-bfx78:~# kubectl get secret vsphere-cpi-creds -n kube-system -o jsonpath="{$}"
{"apiVersion":"v1","data":{"10.99.99.4.password":"******","10.99.99.4.username":"******"},"kind":"Secret","metadata":{"annotations":{"meta.helm.sh/release-name":"rancher-vsphere-cpi","meta.helm.sh/release-namespace":"kube-system"},"creationTimestamp":"2022-03-03T10:24:26Z","labels":{"app.kubernetes.io/managed-by":"Helm","component":"rancher-vsphere-cpi-cloud-controller-manager","vsphere-cpi-infra":"secret"},"managedFields":[{"apiVersion":"v1","fieldsType":"FieldsV1","fieldsV1":{"f:data":{".":{},"f:10.99.99.4.password":{},"f:10.99.99.4.username":{}},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{},"f:component":{},"f:vsphere-cpi-infra":{}}},"f:type":{}},"manager":"helm","operation":"Update","time":"2022-03-03T10:24:26Z"}],"name":"vsphere-cpi-creds","namespace":"kube-system","resourceVersion":"733","uid":"c5e3cdfb-a4fb-4f58-91d9-11fbf4ac63be"},"type":"Opaque"}root@vs6-ctl-a15936a0-bfx78:~#
root@vs6-ctl-a15936a0-bfx78:~#
root@vs6-ctl-a15936a0-bfx78:~# kubectl get secret vsphere-config-secret -n kube-system -o jsonpath="{$}"
{"apiVersion":"v1","data":{"csi-vsphere.conf":"W0dsb2JhbF0KY2x1c3Rlci1pZCA9ICJjLW0ta2tnbXIyN3giCnVzZXIgPSAiIgpwYXNzd29yZCA9ICIiCnBvcnQgPSAiNDQzIgppbnNlY3VyZS1mbGFnID0gIjEiCgpbVmlydHVhbENlbnRlciAiIl0KZGF0YWNlbnRlcnMgPSAiIgo="},"kind":"Secret","metadata":{"annotations":{"meta.helm.sh/release-name":"rancher-vsphere-csi","meta.helm.sh/release-namespace":"kube-system"},"creationTimestamp":"2022-03-03T10:24:26Z","labels":{"app.kubernetes.io/managed-by":"Helm"},"managedFields":[{"apiVersion":"v1","fieldsType":"FieldsV1","fieldsV1":{"f:data":{".":{},"f:csi-vsphere.conf":{}},"f:metadata":{"f:annotations":{".":{},"f:meta.helm.sh/release-name":{},"f:meta.helm.sh/release-namespace":{}},"f:labels":{".":{},"f:app.kubernetes.io/managed-by":{}}},"f:type":{}},"manager":"helm","operation":"Update","time":"2022-03-03T10:24:26Z"}],"name":"vsphere-config-secret","namespace":"kube-system","resourceVersion":"689","uid":"eb7697fb-7b85-400f-9a8a-70ad89d96612"},"type":"Opaque"}root@vs6-ctl-a15936a0-bfx78:~#
root@vs6-ctl-a15936a0-bfx78:~#
root@vs6-ctl-a15936a0-bfx78:~# kubectl get secret vsphere-config-secret -n kube-system -o jsonpath="{$.data.csi-vsphere\.conf}" | base64 --decode
[Global]
cluster-id = "c-m-kkgmr27x"
user = ""
password = ""
port = "443"
insecure-flag = "1"
[VirtualCenter ""]
datacenters = ""
root@vs6-ctl-a15936a0-bfx78:~#
Hmm, I'm not sure where that CusterID is coming from. Do you still have the chart installed from Rancher Apps or are you at this point only using the version that RKE2 deploys?
Its only from RKE2, basically with each round of testing I go to create a new cluster and then select the VMware vSphere option and proceed from there.
I ran another test where I switched the order of the Additional Manifest entries, this time putting rancher-vsphere-csi
before rancher-vsphere-cpi
but I still get the same result.
I finally managed to provision a cluster on vSphere with external storage.
However I was only able to do so by reverting to RKE1 with an in-tree cloud provider.
So It seems like the RKE2 'Tech Preview' is still a bit buggy.
We are deploying rke2 v1.21.6+rke2r1 with the cloud-provider-name of rancher-vsphere which deploys the CPI and CSI charts. Then we provide the configuration for both using the HelmChartConfig files like this:
---
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: rancher-vsphere-cpi
namespace: kube-system
spec:
valuesContent: |-
vCenter:
host: "{{ vcenter_host }}"
datacenters: "{{ vcenter_datacenters }}"
username: "{{ vcenter_username }}"
password: "{{ vcenter_password }}"
credentialsSecret:
generate: true
cloudControllerManager:
nodeSelector:
node-role.kubernetes.io/control-plane: "true"
---
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: rancher-vsphere-csi
namespace: kube-system
spec:
valuesContent: |-
vCenter:
host: "{{ vcenter_host }}"
port: 443
insecureFlag: '1'
clusterId: "{{ kubernetes_cluster_name }}"
datacenters: "{{ vcenter_datacenters }}"
username: "{{ vcenter_username }}"
password: "{{ vcenter_password }}"
configSecret:
generate: true
csiController:
nodeSelector:
node-role.kubernetes.io/control-plane: "true"
csiResizer:
enabled: false
storageClass:
enabled: true
name: vsphere
isDefault: true
Not sure if this is helpful but it has deployed for us without an issue.
@mitchellmaler thanks I appreciate that, it helps a lot the provisioning finishes successfully with this nodeSelector config.
The only problem that I have now is the vsphere-csi-controller pod has status=Crashloopbackoff and the logs report:
connection.go:172] Still connecting to unix:///csi/csi.sock
I see that the vsphere-config-secret
is still not being created correctly, perhaps this is related.
I've made 2 changes and its working now.
storageClass:
enabled: true
name: vsphere
isDefault: true
datastoreURL: ds:///vmfs/volumes/5e4bac8b-c0242362-d279-44a8427e1bb5/
vsphere-config-secret
, filling in the blank values and changing the default cluster-id value.I think that there must be an underlying issue with the helm chart which is meant to create the vsphere-config-secret
It looks like I spoke too soon. I managed to get it running on a cluster with a single node with all 3 roles (etcd, control-plane & worker).
However if I create a pool with only etcd + control-plane then the provisioning doesn't finish
and after the final probe for calico then the logs keep polling with the message:
waiting for cluster agent to be available and join url to be available on bootstrap node
If I create a cluster with 2 pools : controller pool with all nodes provisions correctly,
but the pool with only worker role is
Waiting for agent to check in and apply initial plan
is there any news on this ticket? I am also trying to migrate to RKE2 which has proven to be a very difficult task.
Hi, trying to set vsphere cpi/csi with rke2, where i can find or got documentation / how to on values to be write in the csi and cpi HelmChartConfig manifest ? Thanks
@seb-835 where i can find or got documentation / how to on values to be write in the csi and cpi
just wanted to comment on this.
This is still an issue when creating RKE2 clusters on vSphere with Rancher v2.7.
The solution for me was to add the two manifests mentioned in an earlier post. I used the .OVA version of the Ubuntu 22.04 cloud image which I found here: https://cloud-images.ubuntu.com/jammy/current/.
---
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: rancher-vsphere-cpi
namespace: kube-system
spec:
valuesContent: |-
vCenter:
host: "10.0.20.20"
datacenters: "Datacenter-A"
username: "administrator@internal.lab"
password: "MyvCenterPassw0rd!"
credentialsSecret:
generate: true
cloudControllerManager:
nodeSelector:
node-role.kubernetes.io/control-plane: "true"
---
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
name: rancher-vsphere-csi
namespace: kube-system
spec:
valuesContent: |-
vCenter:
host: "10.0.20.20"
port: 443
insecureFlag: '1'
clusterId: "cluster81"
datacenters: "Datacenter-A"
username: "administrator@internal.lab"
password: "MyvCenterPassw0rd!"
configSecret:
generate: true
csiController:
nodeSelector:
node-role.kubernetes.io/control-plane: "true"
csiResizer:
enabled: false
storageClass:
enabled: true
name: vsphere
isDefault: true
datastoreURL: "ds:///vmfs/volumes/610996f6-af374ab9-7109-1c697aa3362c/"
@havkros You've added a nodeSelector to place the pods on the control-plane nodes. I don't see anything about taints or tolerations?
I didn’t do anything with taints and tolerations. My issue might have been something else than the original post, but pasting these two manifests made things work for me.
I struggled for at least a week to get rancher to create a RKE2 cluster using vsphere. Nothing seemed to work, so I found this post and basically copied these two manifests and used them as is.
i haven’t really looked into it, but it looks like rancher doesn’t pass the csi / cpi values to the rke2 installation process? I’ve always used the form fields in the add-on config menu to enter the details about vcenter, login creds, datacenter name etc, but the manifests are needed to get rid of the “agent is waiting to connect” error.
It is supposed to, but I believe there's an open issue on the Rancher side about it only passing them to the CPI chart instead of both CPI and CSI. That would be a Dashboard or Rancher bug though, not RKE2.
Closing because there doesn't appear to be an RKE2 bug at this point and this issue has gone stale. Please open a new issue if a bug is identified.
Environmental Info: RKE2 Version:
rke2 version v1.21.9+rke2r1 (e48f07f7b208c0e43c537fca006cd5b6ce31b13b) go version go1.16.10b7
Node(s) CPU architecture, OS, and Version:
Linux vs-test1-controller-29377752-vp8h2 5.4.0-99-generic #112-Ubuntu SMP Thu Feb 3 13:50:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
2 controllers 1 worker I also tried 3 controllers 3 workers same issue
Describe the bug:
After successfully installing vSphere CPI (100.1.0+up1.0.100) via the charts on Rancher I tried to install vSphere CSI (100.1.0+up2.3.0) The installation hangs with the following logs:
Connected Filter helm install --namespace=kube-system --timeout=10m0s --values=/home/shell/helm/values-rancher-vsphere-csi-100.1.0-up2.3.0.yaml --version=100.1.0+up2.3.0 --wait=true vsphere-csi /home/shell/helm/rancher-vsphere-csi-100.1.0-up2.3.0.tgz creating 13 resource(s) beginning wait for 13 resources with timeout of 10m0s Deployment is not ready: kube-system/vsphere-csi-controller. 0 out of 1 expected pods are ready DaemonSet is not ready: kube-system/vsphere-csi-node. 0 out of 1 expected pods are ready DaemonSet is not ready: kube-system/vsphere-csi-node. 0 out of 1 expected pods are ready DaemonSet is not ready: kube-system/vsphere-csi-node. 0 out of 1 expected pods are ready DaemonSet is not ready: kube-system/vsphere-csi-node. 0 out of 1 expected pods are ready DaemonSet is not ready: kube-system/vsphere-csi-node. 0 out of 1 expected pods are ready DaemonSet is not ready: kube-system/vsphere-csi-node. 0 out of 1 expected pods are ready DaemonSet is not ready: kube-system/vsphere-csi-node. 0 out of 1 expected pods are ready Deployment is not ready: kube-system/vsphere-csi-controller. 0 out of 1 expected pods are ready Deployment is not ready: kube-system/vsphere-csi-controller. 0 out of 1 expected pods are ready Deployment is not ready: kube-system/vsphere-csi-controller. 0 out of 1 expected pods are ready
while showing the following error on the deployment of vsphere-csi-controller
0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 node(s) had taint {node-role.kubernetes.io/control-plane: }, that the pod didn't tolerate.
I removed the taint withkubectl taint nodes controller1 node-role.kubernetes.io/control-plane-
The the next error is0/3 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 node(s) had taint {node-role.kubernetes.io/etcd: }, that the pod didn't tolerate.
I similarly removed this taint withkubectl taint nodes controller1 node-role.kubernetes.io/control-plane-
Then the application installs however I never got the vSphere storage working and subsequently saw strange behaviour, so deleted the cluster. Strange behaviour included PersistentVolumes and StorageClasses menu items disappearing from the menu, after having created Storage Class for testing. I was a concerned about having to remove the standard taints:
especially as the chart defines tolerarations for them as per below,
I was also surprised to see the error message complaining about the taint by key only(node-role.kubernetes.io/control-plane, node-role.kubernetes.io/etcd) without the effect (NoSchedule, NoExecute)
Steps To Reproduce:
I started with 3 manually provisioned VM's on a vmware cluster running ubuntu-20.04 (1 controller 2 workers) I followed this guide and installed RKE2 server https://rancher.com/docs/rancher/v2.5/en/installation/resources/k8s-tutorials/ha-rke2/#1-install-kubernetes-and-set-up-the-rke2-server Next I followed this guide to install rancher https://rancher.com/docs/rancher/v2.0-v2.4/en/installation/install-rancher-on-k8s/
Next I'm going to create an auto-provisioned cluster on vSphere Log into rancher as admin Create Cloud Credentials for VMware vSphere Create Cluster select VMware vSphere add all the details create a control pool and a worker pool accepts the default configs (kubernetes 1.21.9+rke2r1, calico, cloud-provider=none) click create wait for cluster to be provisioned navigate to Apps Marketplace/Charts install vSphere CPI install vSphere CSI
Expected behavior:
I expect the vSphere CSI installation to complete successfully without being blocked by standard taints on the control-nodes.
Actual behavior:
The installation didn't complete until the standard etcd and control-plane taints were removed.
Additional context / logs: