Closed hrbasic closed 1 year ago
Hi @hrbasic , thanks for opening the issue.
Could you please also add information about which version you upgraded from and provide some example yaml file? This would help to trace down the issue to the related change.
Hi,
Previous version was v1.6.1
.
YAML examples:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereCluster
metadata:
name: io-hbasic-1
namespace: io-hbasic-1-iot1-cluster
labels:
k8s.domain.com/iks-cluster: "true"
annotations:
node.alpha.kubernetes.io/ttl: "0"
spec:
controlPlaneEndpoint:
host: 10.38.29.74
port: 6443
identityRef:
kind: Secret
name: <secret>
server: vc-io-anc-01.io-domain.local
thumbprint: ''
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
metadata:
name: io-hbasic-1-control-plane-v1.26.3
namespace: io-hbasic-1-iot1-cluster
labels:
k8s.domain.com/cluster-name: io-hbasic-1
k8s.domain.com/control-plane: "true"
k8s.domain.com/iks-cluster: "true"
annotations:
node.alpha.kubernetes.io/ttl: "0"
spec:
template:
spec:
cloneMode: linkedClone
datacenter: IO
datastore:
diskGiB: 55
folder:
memoryMiB: 8192
network:
devices:
- dhcp4: false
dhcp6: false
gateway4: 10.38.28.1
networkName: DS01-DPG-388
nameservers:
- 169.254.53.53
numCPUs: 2
os: Linux
resourcePool:
server: vc-io-anc-01.io-domain.local
storagePolicyName: ""
template: Rocky8-k8s-capi-2023-04-12-kube-v1.26.3
thumbprint: ''
---
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
name: io-hbasic-1
namespace: io-hbasic-1-iot1-cluster
labels:
k8s.domain.com/control-plane: "true"
k8s.domain.com/cluster-name: io-hbasic-1
k8s.domain.com/location: iot1
k8s.domain.com/version: v1.26.3
k8s.domain.com/iks-cluster: "true"
annotations:
node.alpha.kubernetes.io/ttl: "0"
spec:
kubeadmConfigSpec:
clusterConfiguration:
apiServer:
certSANs:
- kubernetes.default.svc.io-hbasic-1.iot1.k8s.io-domain.local
- localhost
- 127.0.0.1
extraArgs:
cloud-provider: external
oidc-issuer-url: "https://dex.io-domain.local"
oidc-client-id: "domain-ad"
oidc-groups-claim: "groups"
oidc-ca-file: /etc/ssl/certs/domain-ca.pem
oidc-username-claim: email
controllerManager:
extraArgs:
cloud-provider: external
allocate-node-cidrs: "false"
bind-address: "0.0.0.0"
scheduler:
extraArgs:
bind-address: "0.0.0.0"
etcd:
local:
extraArgs:
listen-metrics-urls: 'http://0.0.0.0:2381'
files:
- content: |
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
name: kube-vip
namespace: kube-system
spec:
containers:
- args:
- manager
env:
- name: cp_enable
value: "true"
- name: vip_interface
value: ""
- name: address
value: 10.38.29.74
- name: port
value: "6443"
- name: vip_arp
value: "true"
- name: vip_leaderelection
value: "true"
- name: vip_leaseduration
value: "15"
- name: vip_renewdeadline
value: "10"
- name: vip_retryperiod
value: "2"
image: ghcr.io/kube-vip/kube-vip:v0.5.11
imagePullPolicy: IfNotPresent
name: kube-vip
resources: {}
securityContext:
capabilities:
add:
- NET_ADMIN
- NET_RAW
volumeMounts:
- mountPath: /etc/kubernetes/admin.conf
name: kubeconfig
hostAliases:
- hostnames:
- kubernetes
ip: 127.0.0.1
hostNetwork: true
volumes:
- hostPath:
path: /etc/kubernetes/admin.conf
type: FileOrCreate
name: kubeconfig
status: {}
owner: root:root
path: /etc/kubernetes/manifests/kube-vip.yaml
- content: |
-----BEGIN CERTIFICATE-----
<my_cert>
-----END CERTIFICATE-----
owner: root:root
path: /etc/ssl/certs/domain-ca.pem
- content: |
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
authentication:
anonymous:
enabled: false
webhook:
cacheTTL: 2m
enabled: true
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
mode: Webhook
webhook:
cacheAuthorizedTTL: 5m
cacheUnauthorizedTTL: 30s
cgroupDriver: systemd
clusterDomain: cluster.local
clusterDNS:
- 169.254.25.10
cpuManagerReconcilePeriod: 10s
evictionPressureTransitionPeriod: 2m
fileCheckFrequency: 20s
httpCheckFrequency: 20s
imageMinimumGCAge: 0s
nodeStatusUpdateFrequency: 10s
rotateCertificates: true
runtimeRequestTimeout: 2m
shutdownGracePeriod: 60s
shutdownGracePeriodCriticalPods: 20s
streamingConnectionIdleTimeout: 4h
staticPodPath: /etc/kubernetes/manifests
syncFrequency: 1m
volumeStatsAggPeriod: 0s
kubeReserved:
cpu: 200m
memory: 512Mi
serverTLSBootstrap: true
systemReserved:
cpu: 300m
memory: 1400Mi
owner: root:root
path: /var/lib/kubelet/kubeletconfiguration0+merge.yaml
- content: |
apiVersion: v1
kind: Pod
metadata:
labels:
component: kube-scheduler
tier: control-plane
name: kube-scheduler
namespace: kube-system
spec:
containers:
- command:
- kube-scheduler
- --config=/etc/kubernetes/ibscheduler.conf
- --authorization-kubeconfig=/etc/kubernetes/scheduler.conf
- --authentication-kubeconfig=/etc/kubernetes/scheduler.conf
image: registry.k8s.io/kube-scheduler:v1.26.3
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 8
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
name: kube-scheduler
resources:
requests:
cpu: 100m
startupProbe:
failureThreshold: 24
httpGet:
host: 127.0.0.1
path: /healthz
port: 10259
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
volumeMounts:
- mountPath: /etc/kubernetes/scheduler.conf
name: kubeconfig
readOnly: true
- mountPath: /etc/kubernetes/ibscheduler.conf
name: ibsched
readOnly: true
hostNetwork: true
priorityClassName: system-node-critical
securityContext:
seccompProfile:
type: RuntimeDefault
volumes:
- hostPath:
path: /etc/kubernetes/scheduler.conf
type: FileOrCreate
name: kubeconfig
- hostPath:
path: /etc/kubernetes/ibscheduler.conf
type: FileOrCreate
name: ibsched
owner: root:root
path: /var/lib/kubelet/kube-scheduler0+merge.yaml
- content: |
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
clientConnection:
kubeconfig: /etc/kubernetes/scheduler.conf
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
defaultingType: List
owner: root:root
path: /etc/kubernetes/ibscheduler.conf
- content: |
---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
metricsBindAddress: "0.0.0.0:10249"
owner: root:root
path: /etc/kubernetes/ib-kube-proxy-conf.yaml
initConfiguration:
nodeRegistration:
criSocket: /var/run/containerd/containerd.sock
name: '{{ local_hostname }}'
kubeletExtraArgs:
cloud-provider: external
patches:
directory: /var/lib/kubelet
joinConfiguration:
nodeRegistration:
criSocket: /var/run/containerd/containerd.sock
name: '{{ local_hostname }}'
kubeletExtraArgs:
cloud-provider: external
patches:
directory: /var/lib/kubelet
preKubeadmCommands:
- hostnamectl set-hostname "{{ local_hostname }}"
- echo "::1 ipv6-localhost ipv6-loopback localhost6 localhost6.localdomain6"
>/etc/hosts
- echo "127.0.0.1 {{ local_hostname }}.io-domain.local {{ local_hostname }} localhost
localhost.localdomain localhost4 localhost4.localdomain4" >>/etc/hosts
- growpart /dev/sda 2 && pvresize /dev/sda2 && lvresize -r -l +100%FREE /dev/os/lv_var_lib_containerd
- cat /etc/kubernetes/ib-kube-proxy-conf.yaml >> /run/kubeadm/kubeadm.yaml
postKubeadmCommands:
- echo '{"run_list":["recipe[ib_iks_vm]"]}' > /etc/chef/first-boot.json
- chef-client -j /etc/chef/first-boot.json -E IOT1
useExperimentalRetryJoin: true
users:
- name: capv
sshAuthorizedKeys:
- ssh-rsa <PUBLIC KEY> <USER>
sudo: ALL=(ALL) NOPASSWD:ALL
machineTemplate:
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: VSphereMachineTemplate
name: io-hbasic-1-control-plane-v1.26.3
metadata:
labels:
k8s.domain.com/nodepool: control-plane
k8s.domain.com/iks-cluster: "true"
annotations:
node.alpha.kubernetes.io/ttl: "0"
replicas: 3
version: v1.26.3
If I label CP zones and add field failureDomainSelector
in VSphereClusterObject, then everything works as expected.
failureDomainSelector:
matchLabels:
topology.k8s.domain.com/group: gp-cp
thumbprint: ''
If you need more info, let me know.
FailureDomainSelector is the label selector to use for failure domain selection for the control plane nodes of the cluster. An empty value for the selector includes all the related failure domains.
So there are three different modes for the field .spec.failureDomainSelector
:
failureDomainSelector: nil
VSphereDeploymentZone
objectsfailureDomainSelector: {}
VSphereDeploymentZone
objectsfailureDomainSelector: { "matchLabels": { "foo": "bar" } }
VSphereDeploymentZone
which match the defined selector Historical context on when that behaviour was changed: https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/pull/1951#issuecomment-1598531482
TLDR: it was considered as bug that an nil selector resulted in considering all failure domains. Looks like it was not really highlighted in the release notes though.
Great, thanks for clarification. I'll close this since it's not considered as a bug. But maybe it would be good to highlight this, because if someone upgrades provider and doesn't update VSphereCluster
with failureDomainSelector
on existing clusters, rollout of kubeadmcontrolplane
could fail.
Thank you @hrbasic for filing the issue and sorry for that.
I updated the release notes to highlight this PR as breaking change and added some information about it in the v1.7.0 release notes, so others may find the information 👍
/kind bug
What steps did you take and what happened: I've upgraded CAPV provider to 1.7.1, after upgrade I couldn't deploy new clusters due to issue with new field
failureDomainSelector
. According to documentation:FailureDomainSelector is the label selector to use for failure domain selection for the control plane nodes of the cluster. An empty value for the selector includes all the related failure domains.
https://doc.crds.dev/github.com/kubernetes-sigs/cluster-api-provider-vsphere/infrastructure.cluster.x-k8s.io/VSphereCluster/v1beta1@v1.7.1.This shouldn't be a breaking change, but if you don't specify FailureDomainSelector on VSphereCluster object, cloning will fail:
unable to get resource pool for "infrastructure.cluster.x-k8s.io/v1beta1, Kind=VSphereVM dev-hbasic-2-iot1-cluster/dev-hbasic-2-lwqjl": no default resource pool found
I've checked machines.cluster.x-k8s.io object and Failure Domain field is missing in spec:Also, in documentation is specified that
FailureDomainSelector
is the label selector to use for failure domain selection. My understanding was that we need to labelvSphereFailureDomain
. But if you check the code: https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/release-1.7/controllers/vspherecluster_reconciler.go#L381, label is checked on the zone, so we need to labelvSphereDeploymentZone
to make this work.What did you expect to happen: After upgrade capv upgrade to 1.7.1, we should be able to deploy new cluster without specifying
FailureDomainSelector
field. Documentation forFailureDomainSelector
should be improved since it's not clear thatvSphereDeploymentZone
should be labeled instead ofvSphereFailureDomain
Environment:
kubectl version
): 1.26.3/etc/os-release
): Rocky Linux 8.7 (Green Obsidian)