Open smerle33 opened 2 months ago
WIP (infra and release)
WEEKLY.CI first
when time ok:
if all went well, redo for infra.ci/release.ci
Update:
Update: on hold until after the 15th of May 2024
The aim is to be able to change the disk type without recreating everything next time. We choose to create PV/PVC/Disk from terraform instead of just the PVC for that (not following only creating the PVC should be enough: kubernetes.io/docs/concepts/storage/persistent-volumes/#dynamic) The disk size can be changed on both scenarii.
current state of test of the temporary migration pod:
Events:
Type Reason Age From Message
Normal Scheduled 2m11s default-scheduler Successfully assigned jenkins-weekly/migrate-volume to aks-arm64small2-30051376-vmss00001i
Warning FailedAttachVolume 2m11s attachdetach-controller Multi-Attach error for volume "pvc-<redacted>" Volume is already used by pod(s) jenkins-weekly-0
Normal SuccessfulAttachVolume 119s attachdetach-controller AttachVolume.Attach succeeded for volume "jenkins-weekly-pv"
Warning FailedMount 8s kubelet Unable to attach or mount volumes: unmounted volumes=[jenkins-home-source], unattached volumes=[jenkins-home-source], failed to process volumes=[] timed out waiting for the condition
EDIT :
may need to change the current PVC to RWC ReadWriteMany
to be able to mount it on a second pod for migration
I tried with pod affinity:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions: # app.kubernetes.io/instance: jenkins-weekly
- key: app.kubernetes.io/instance
operator: In
values:
- jenkins-weekly
topologyKey: app.kubernetes.io/instance
without any luck
0/8 nodes are available: 1 Insufficient cpu, 2 node(s) didn't match pod affinity rules, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 3 node(s) didn't match P │
│ od's node affinity/selector. preemption: 0/8 nodes are available: 1 No preemption victims found for incoming pod, 7 Preemption is not helpful for scheduling..
pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) didn't match pod affinity rules
will try with node selector directly
I tried with pod affinity:
affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: # app.kubernetes.io/instance: jenkins-weekly - key: app.kubernetes.io/instance operator: In values: - jenkins-weekly topologyKey: app.kubernetes.io/instance
without any luck
0/8 nodes are available: 1 Insufficient cpu, 2 node(s) didn't match pod affinity rules, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 3 node(s) didn't match P │ │ od's node affinity/selector. preemption: 0/8 nodes are available: 1 No preemption victims found for incoming pod, 7 Preemption is not helpful for scheduling.. pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) didn't match pod affinity rules
will try with node selector directly
Don't forget that, in any case, you need the tolerations to schedule on arm64
nodes such as this one (ref. https://github.com/jenkins-infra/kubernetes-management/blob/899229e1620277d3750ed261417703a073a4736d/config/jenkins_weekly.ci.jenkins.io.yaml#L29-L35) which might explains why pod affinity was necessary but not sufficient
temporary pod definition:
apiVersion: v1
kind: Pod
metadata:
name: migrate-volume
labels:
name: migrate-volume
namespace: jenkins-weekly
spec:
containers:
- image: debian
name: migrate-volume-script
command: ["rsync"]
args: ["-a", "/var/jenkins_home", "/mnt/"]
volumeMounts:
- mountPath: /var/jenkins_home
name: jenkins-home-source
readOnly: true
- mountPath: /mnt
name: jenkins-home-destination
resources:
requests:
memory: "1Gi"
cpu: "1000m"
limits:
memory: "1Gi"
cpu: "1000m"
nodeSelector:
kubernetes.io/arch: arm64
tolerations:
- key: "kubernetes.io/arch"
operator: "Equal"
value: "arm64"
effect: "NoSchedule"
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions: # app.kubernetes.io/instance: jenkins-weekly
- key: app.kubernetes.io/instance
operator: In
values:
- jenkins-weekly
topologyKey: app.kubernetes.io/instance
restartPolicy: Never
volumes:
- name: jenkins-home-source
persistentVolumeClaim:
claimName: jenkins-weekly
- name: jenkins-home-destination
persistentVolumeClaim:
claimName: jenkins-weekly-data
first try:
0/8 nodes are available: 1 Insufficient cpu, 2 node(s) had untolerated taint {CriticalAddonsOnly:
true}, 5 node(s) didn't match Pod's node affinity/selector. preemption: 0/8 nodes are available: 1 No preemption victims found for incoming pod, 7 Preemption is not helpful for scheduling..
so I though that was the 1 Insufficient cpu matching the node hosting jenkins-weekly
second try:
the jenkins-weekly
pod stayed on the same node
third try:
The jenkins-weekly
pod started on the new node: aks-arm64small2-30051376-vmss00001p/10.245.0.13
But the temporary pod stayed pending with :
0/9 nodes are available: 2 node(s) had untolerated taint {CriticalAddonsOnly:
true}, 3 node(s) didn't match Pod's node affinity/selector, 4 node(s) didn't
match pod affinity rules. preemption: 0/9 nodes are available: 9 Preemption is
not helpful for scheduling..
New try:
- spawn manually a new node in armsmall nodepool
- delete the jenkins-weekly pod to see if it start again on the new pod
- start the temporary pod when weekly is on the new node.
New try:
- spawn manually a new node in armsmall nodepool
- delete the jenkins-weekly pod to see if it start again on the new pod
- start the temporary pod when weekly is on the new node.
still the same behavior:
0/9 nodes are available: 2 node(s) had untolerated taint {CriticalAddonsOnly:
true}, 3 node(s) didn't match Pod's node affinity/selector, 4 node(s) didn't
match pod affinity rules. preemption: 0/9 nodes are available: 9 Preemption is
not helpful for scheduling..
found it affinity was WRONG topologyKey need to be hostname
not instance
as it refer to the node:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/instance
operator: In
values:
- jenkins-weekly
topologyKey: kubernetes.io/hostname
So the process to follow is :
jenkins-weekly
pod and check that it re-spawn on the new nodetemporary
pod that will spawn on the new node because of affinityWhen removing the resources request it can be spawned on the existing node of jenkins-weekly, no need to use a brand new node 🎉
resources:
requests:
memory: "1Gi"
cpu: "1000m"
limits:
memory: "1Gi"
cpu: "1000m"
final version of the pod migration for jenkins-weekly will rsync data
jenkins-weekly
jenkins-weekly-data
apiVersion: v1
kind: Pod
metadata:
name: migrate-volume
labels:
name: migrate-volume
namespace: jenkins-weekly
spec:
securityContext:
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
containers:
- image: jenkinsciinfra/packaging:latest
name: migrate-volume-script
command: ["rsync"]
args: ["-a", "--delete", "/var/jenkins_home", "/mnt/"] #will create the destination folder within /mnt/
volumeMounts:
- mountPath: /var/jenkins_home
name: jenkins-home-source
readOnly: true
- mountPath: /mnt
name: jenkins-home-destination
nodeSelector:
kubernetes.io/arch: arm64
tolerations:
- key: "kubernetes.io/arch"
operator: "Equal"
value: "arm64"
effect: "NoSchedule"
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/instance
operator: In
values:
- jenkins-weekly
topologyKey: kubernetes.io/hostname
restartPolicy: Never
volumes:
- name: jenkins-home-source
persistentVolumeClaim:
claimName: jenkins-weekly
- name: jenkins-home-destination
persistentVolumeClaim:
claimName: jenkins-weekly-data
Service(s)
infra.ci.jenkins.io, release.ci.jenkins.io, weekly.ci.jenkins.io
Summary
a checked with the metrics, standard ZRS hdd will be enough to handle the workload for those 3 controllers. lets try to save some money.
this will be the occasion to handle the volumes/disk with terraform and remove the Datasource annotation from the helmchart values for controllers.
We will need to create a new Storage Class (on publick8s and privatek8s).
Sidenote: we will have to handle the boostrap permissions from the terraform managed volumes.
Reproduction steps
No response