migrate storage from premium to standard for jenkins-infra, jenkins-weekly and jenkins-release

smerle33 commented 2 months ago

Service(s)

infra.ci.jenkins.io, release.ci.jenkins.io, weekly.ci.jenkins.io

Summary

a checked with the metrics, standard ZRS hdd will be enough to handle the workload for those 3 controllers. lets try to save some money.

this will be the occasion to handle the volumes/disk with terraform and remove the Datasource annotation from the helmchart values for controllers.

We will need to create a new Storage Class (on publick8s and privatek8s).

Sidenote: we will have to handle the boostrap permissions from the terraform managed volumes.

Reproduction steps

No response

smerle33 commented 2 months ago

WIP (infra and release)

smerle33 commented 2 months ago

[x] Add a new storage class for standard ssd ZRS on private and public - https://github.com/jenkins-infra/azure/pull/672

WEEKLY.CI first

[x] take a snapshot for safety - jenkins-weekly-snapshot-20240515-0907
[x] add PV/PVC/DISK for weekly.ci with terraform (https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/persistent_volume and https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/persistent_volume_claim). ~~only creating the PVC should be enough: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#dynamic~~ see https://github.com/jenkins-infra/helpdesk/issues/4044#issuecomment-2121849985
[ ] create a temporary pod with both the new PVC as RW and the Weekly PVC as ReadOnly
[ ] do a timed rsync test between disk

when time ok:

[ ] take a new snapshot for security (remove the one done above)
[ ] disable kubernetes-management builds
[ ] remove the weekly statefulset
[ ] on the temporary pod re-run the rsync (should be fast)
[ ] change the helm values for weekly to use the new PVC : persistence.existingClaim to PVC_NAME
[ ] enable kubernetes-management builds
[ ] merge the PR (changing the chart)
[ ] check that the pod start with the correct PV/PVC
[ ] remove the OLD PV/PVC/DISK (keep the snapshot for a few days)
[ ] remove the snapshot

if all went well, redo for infra.ci/release.ci

dduportal commented 2 months ago

Update:

New storage classes added on the 2 clusters in https://github.com/jenkins-infra/azure/pull/672
weekly.ci planned to be migrated next week (Monday 29/Tuesday 30)

dduportal commented 1 month ago

Update: on hold until after the 15th of May 2024

smerle33 commented 1 month ago

The aim is to be able to change the disk type without recreating everything next time. We choose to create PV/PVC/Disk from terraform instead of just the PVC for that (not following only creating the PVC should be enough: kubernetes.io/docs/concepts/storage/persistent-volumes/#dynamic) The disk size can be changed on both scenarii.

smerle33 commented 3 weeks ago

current state of test of the temporary migration pod:

Events:
    Type     Reason                  Age    From                     Message
    Normal   Scheduled               2m11s  default-scheduler        Successfully assigned jenkins-weekly/migrate-volume to aks-arm64small2-30051376-vmss00001i
    Warning  FailedAttachVolume      2m11s  attachdetach-controller  Multi-Attach error for volume "pvc-<redacted>" Volume is already used by pod(s) jenkins-weekly-0
    Normal   SuccessfulAttachVolume  119s   attachdetach-controller  AttachVolume.Attach succeeded for volume "jenkins-weekly-pv"
    Warning  FailedMount             8s     kubelet                  Unable to attach or mount volumes: unmounted volumes=[jenkins-home-source], unattached volumes=[jenkins-home-source], failed to process volumes=[] timed out waiting for the condition

infos: https://medium.com/@golusstyle/demystifying-the-multi-attach-error-for-volume-causes-and-solutions-595a19316a0c

EDIT : may need to change the current PVC to RWC ReadWriteMany to be able to mount it on a second pod for migration

smerle33 commented 3 weeks ago

I tried with pod affinity:

affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions: # app.kubernetes.io/instance: jenkins-weekly
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - jenkins-weekly
        topologyKey: app.kubernetes.io/instance

without any luck

0/8 nodes are available: 1 Insufficient cpu, 2 node(s) didn't match pod affinity rules, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 3 node(s) didn't match P │
│ od's node affinity/selector. preemption: 0/8 nodes are available: 1 No preemption victims found for incoming pod, 7 Preemption is not helpful for scheduling..
pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) didn't match pod affinity rules

will try with node selector directly

dduportal commented 3 weeks ago

I tried with pod affinity:

affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions: # app.kubernetes.io/instance: jenkins-weekly
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - jenkins-weekly
        topologyKey: app.kubernetes.io/instance

without any luck

0/8 nodes are available: 1 Insufficient cpu, 2 node(s) didn't match pod affinity rules, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 3 node(s) didn't match P │
│ od's node affinity/selector. preemption: 0/8 nodes are available: 1 No preemption victims found for incoming pod, 7 Preemption is not helpful for scheduling..
pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) didn't match pod affinity rules

will try with node selector directly

Don't forget that, in any case, you need the tolerations to schedule on arm64 nodes such as this one (ref. https://github.com/jenkins-infra/kubernetes-management/blob/899229e1620277d3750ed261417703a073a4736d/config/jenkins_weekly.ci.jenkins.io.yaml#L29-L35) which might explains why pod affinity was necessary but not sufficient

smerle33 commented 5 days ago

temporary pod definition:

apiVersion: v1
kind: Pod
metadata:
  name: migrate-volume
  labels:
    name: migrate-volume
  namespace: jenkins-weekly
spec:
  containers:
  - image: debian
    name: migrate-volume-script
    command: ["rsync"]
    args: ["-a", "/var/jenkins_home", "/mnt/"]
    volumeMounts:
    - mountPath: /var/jenkins_home
      name: jenkins-home-source
      readOnly: true
    - mountPath: /mnt
      name: jenkins-home-destination
    resources:
      requests:
        memory: "1Gi"
        cpu: "1000m"
      limits:
        memory: "1Gi"
        cpu: "1000m"
  nodeSelector:
    kubernetes.io/arch: arm64
  tolerations:
    - key: "kubernetes.io/arch"
      operator: "Equal"
      value: "arm64"
      effect: "NoSchedule"
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions: # app.kubernetes.io/instance: jenkins-weekly
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - jenkins-weekly
        topologyKey: app.kubernetes.io/instance
  restartPolicy: Never
  volumes:
  - name: jenkins-home-source
    persistentVolumeClaim:
      claimName: jenkins-weekly
  - name: jenkins-home-destination
    persistentVolumeClaim:
      claimName: jenkins-weekly-data

first try:

forcing the temporary pod on the same node as the jenkins-weekly + affinity fail with :

0/8 nodes are available: 1 Insufficient cpu, 2 node(s) had untolerated taint {CriticalAddonsOnly:
true}, 5 node(s) didn't match Pod's node affinity/selector. preemption: 0/8 nodes are available: 1 No preemption victims found for incoming pod, 7 Preemption is not helpful for scheduling..

so I though that was the 1 Insufficient cpu matching the node hosting jenkins-weekly

second try:

having the temporary pod pending with affinity
delete jenkins-weekly pod to watch if it spawn a new node and start both pods on the new node

the jenkins-weeklypod stayed on the same node

third try:

having the temporary pod pending with affinity
manually spawn a news arm node in the node pool
delete jenkins-weekly pod to watch if both pods starts on the new node

The jenkins-weekly pod started on the new node: aks-arm64small2-30051376-vmss00001p/10.245.0.13 But the temporary pod stayed pending with :

0/9 nodes are available: 2 node(s) had untolerated taint {CriticalAddonsOnly:
  true}, 3 node(s) didn't match Pod's node affinity/selector, 4 node(s) didn't
  match pod affinity rules. preemption: 0/9 nodes are available: 9 Preemption is
  not helpful for scheduling..

smerle33 commented 5 days ago

New try:

- spawn manually a new node in armsmall nodepool
- delete the jenkins-weekly pod to see if it start again on the new pod
- start the temporary pod when weekly is on the new node.

smerle33 commented 5 days ago

New try:

- spawn manually a new node in armsmall nodepool
- delete the jenkins-weekly pod to see if it start again on the new pod
- start the temporary pod when weekly is on the new node.

still the same behavior:

  0/9 nodes are available: 2 node(s) had untolerated taint {CriticalAddonsOnly:
  true}, 3 node(s) didn't match Pod's node affinity/selector, 4 node(s) didn't
  match pod affinity rules. preemption: 0/9 nodes are available: 9 Preemption is
  not helpful for scheduling..

smerle33 commented 5 days ago

found it affinity was WRONG topologyKey need to be hostname not instance as it refer to the node:

  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - jenkins-weekly
        topologyKey: kubernetes.io/hostname

smerle33 commented 4 days ago

So the process to follow is :

spawn a new compatible node for both pods (arm64)
delete the jenkins-weekly pod and check that it re-spawn on the new node
create the temporary pod that will spawn on the new node because of affinity

smerle33 commented 4 days ago

When removing the resources request it can be spawned on the existing node of jenkins-weekly, no need to use a brand new node 🎉

  resources:
      requests:
        memory: "1Gi"
        cpu: "1000m"
      limits:
        memory: "1Gi"
        cpu: "1000m"

smerle33 commented 3 days ago

final version of the pod migration for jenkins-weekly will rsync data

from /var/jenkins_home on pvc jenkins-weekly
to /mnt/ on pvc jenkins-weekly-data

apiVersion: v1
kind: Pod
metadata:
  name: migrate-volume
  labels:
    name: migrate-volume
  namespace: jenkins-weekly
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
  containers:
  - image: jenkinsciinfra/packaging:latest
    name: migrate-volume-script
    command: ["rsync"]
    args: ["-a", "--delete", "/var/jenkins_home", "/mnt/"] #will create the destination folder within /mnt/
    volumeMounts:
    - mountPath: /var/jenkins_home
      name: jenkins-home-source
      readOnly: true
    - mountPath: /mnt
      name: jenkins-home-destination
  nodeSelector:
    kubernetes.io/arch: arm64
  tolerations:
    - key: "kubernetes.io/arch"
      operator: "Equal"
      value: "arm64"
      effect: "NoSchedule"
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - jenkins-weekly
        topologyKey: kubernetes.io/hostname
  restartPolicy: Never
  volumes:
  - name: jenkins-home-source
    persistentVolumeClaim:
      claimName: jenkins-weekly
  - name: jenkins-home-destination
    persistentVolumeClaim:
      claimName: jenkins-weekly-data

jenkins-infra / helpdesk