jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
16 stars 9 forks source link

migrate storage from premium to standard for jenkins-infra, jenkins-weekly and jenkins-release #4044

Open smerle33 opened 2 months ago

smerle33 commented 2 months ago

Service(s)

infra.ci.jenkins.io, release.ci.jenkins.io, weekly.ci.jenkins.io

Summary

a checked with the metrics, standard ZRS hdd will be enough to handle the workload for those 3 controllers. lets try to save some money.

this will be the occasion to handle the volumes/disk with terraform and remove the Datasource annotation from the helmchart values for controllers.

We will need to create a new Storage Class (on publick8s and privatek8s).

Sidenote: we will have to handle the boostrap permissions from the terraform managed volumes.

Reproduction steps

No response

smerle33 commented 2 months ago

WIP (infra and release)

Capture d’écran 2024-04-11 à 10 56 00 Capture d’écran 2024-04-11 à 10 58 07 Capture d’écran 2024-04-15 à 15 43 19 Capture d’écran 2024-04-15 à 15 44 39 Capture d’écran 2024-04-11 à 10 55 22 Capture d’écran 2024-04-11 à 10 55 32 Capture d’écran 2024-04-11 à 10 55 48
smerle33 commented 2 months ago

WEEKLY.CI first

when time ok:

if all went well, redo for infra.ci/release.ci

dduportal commented 2 months ago

Update:

dduportal commented 1 month ago

Update: on hold until after the 15th of May 2024

smerle33 commented 1 month ago

The aim is to be able to change the disk type without recreating everything next time. We choose to create PV/PVC/Disk from terraform instead of just the PVC for that (not following only creating the PVC should be enough: kubernetes.io/docs/concepts/storage/persistent-volumes/#dynamic) The disk size can be changed on both scenarii.

smerle33 commented 3 weeks ago

current state of test of the temporary migration pod:

Events:
    Type     Reason                  Age    From                     Message
    Normal   Scheduled               2m11s  default-scheduler        Successfully assigned jenkins-weekly/migrate-volume to aks-arm64small2-30051376-vmss00001i
    Warning  FailedAttachVolume      2m11s  attachdetach-controller  Multi-Attach error for volume "pvc-<redacted>" Volume is already used by pod(s) jenkins-weekly-0
    Normal   SuccessfulAttachVolume  119s   attachdetach-controller  AttachVolume.Attach succeeded for volume "jenkins-weekly-pv"
    Warning  FailedMount             8s     kubelet                  Unable to attach or mount volumes: unmounted volumes=[jenkins-home-source], unattached volumes=[jenkins-home-source], failed to process volumes=[] timed out waiting for the condition

infos: https://medium.com/@golusstyle/demystifying-the-multi-attach-error-for-volume-causes-and-solutions-595a19316a0c

EDIT : may need to change the current PVC to RWC ReadWriteMany to be able to mount it on a second pod for migration

smerle33 commented 3 weeks ago

I tried with pod affinity:

affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions: # app.kubernetes.io/instance: jenkins-weekly
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - jenkins-weekly
        topologyKey: app.kubernetes.io/instance   

without any luck

0/8 nodes are available: 1 Insufficient cpu, 2 node(s) didn't match pod affinity rules, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 3 node(s) didn't match P │
│ od's node affinity/selector. preemption: 0/8 nodes are available: 1 No preemption victims found for incoming pod, 7 Preemption is not helpful for scheduling..
pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) didn't match pod affinity rules

will try with node selector directly

dduportal commented 3 weeks ago

I tried with pod affinity:

affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions: # app.kubernetes.io/instance: jenkins-weekly
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - jenkins-weekly
        topologyKey: app.kubernetes.io/instance   

without any luck

0/8 nodes are available: 1 Insufficient cpu, 2 node(s) didn't match pod affinity rules, 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 3 node(s) didn't match P │
│ od's node affinity/selector. preemption: 0/8 nodes are available: 1 No preemption victims found for incoming pod, 7 Preemption is not helpful for scheduling..
pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) didn't match pod affinity rules

will try with node selector directly

Don't forget that, in any case, you need the tolerations to schedule on arm64 nodes such as this one (ref. https://github.com/jenkins-infra/kubernetes-management/blob/899229e1620277d3750ed261417703a073a4736d/config/jenkins_weekly.ci.jenkins.io.yaml#L29-L35) which might explains why pod affinity was necessary but not sufficient

smerle33 commented 5 days ago

temporary pod definition:

apiVersion: v1
kind: Pod
metadata:
  name: migrate-volume
  labels:
    name: migrate-volume
  namespace: jenkins-weekly
spec:
  containers:
  - image: debian
    name: migrate-volume-script
    command: ["rsync"]
    args: ["-a", "/var/jenkins_home", "/mnt/"]
    volumeMounts:
    - mountPath: /var/jenkins_home
      name: jenkins-home-source
      readOnly: true
    - mountPath: /mnt
      name: jenkins-home-destination
    resources:
      requests:
        memory: "1Gi"
        cpu: "1000m"
      limits:
        memory: "1Gi"
        cpu: "1000m"
  nodeSelector:
    kubernetes.io/arch: arm64
  tolerations:
    - key: "kubernetes.io/arch"
      operator: "Equal"
      value: "arm64"
      effect: "NoSchedule"
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions: # app.kubernetes.io/instance: jenkins-weekly
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - jenkins-weekly
        topologyKey: app.kubernetes.io/instance
  restartPolicy: Never
  volumes:
  - name: jenkins-home-source
    persistentVolumeClaim:
      claimName: jenkins-weekly
  - name: jenkins-home-destination
    persistentVolumeClaim:
      claimName: jenkins-weekly-data

first try:

second try:

the jenkins-weeklypod stayed on the same node

third try:

The jenkins-weekly pod started on the new node: aks-arm64small2-30051376-vmss00001p/10.245.0.13 But the temporary pod stayed pending with :

0/9 nodes are available: 2 node(s) had untolerated taint {CriticalAddonsOnly:
  true}, 3 node(s) didn't match Pod's node affinity/selector, 4 node(s) didn't
  match pod affinity rules. preemption: 0/9 nodes are available: 9 Preemption is
  not helpful for scheduling..
smerle33 commented 5 days ago

New try:

- spawn manually a new node in armsmall nodepool
- delete the jenkins-weekly pod to see if it start again on the new pod
- start the temporary pod when weekly is on the new node.
smerle33 commented 5 days ago

New try:

- spawn manually a new node in armsmall nodepool
- delete the jenkins-weekly pod to see if it start again on the new pod
- start the temporary pod when weekly is on the new node.

still the same behavior:

  0/9 nodes are available: 2 node(s) had untolerated taint {CriticalAddonsOnly:
  true}, 3 node(s) didn't match Pod's node affinity/selector, 4 node(s) didn't
  match pod affinity rules. preemption: 0/9 nodes are available: 9 Preemption is
  not helpful for scheduling..
smerle33 commented 5 days ago

found it affinity was WRONG topologyKey need to be hostname not instance as it refer to the node:

  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - jenkins-weekly
        topologyKey: kubernetes.io/hostname
smerle33 commented 4 days ago

So the process to follow is :

smerle33 commented 4 days ago

When removing the resources request it can be spawned on the existing node of jenkins-weekly, no need to use a brand new node 🎉

  resources:
      requests:
        memory: "1Gi"
        cpu: "1000m"
      limits:
        memory: "1Gi"
        cpu: "1000m"
smerle33 commented 3 days ago

final version of the pod migration for jenkins-weekly will rsync data

apiVersion: v1
kind: Pod
metadata:
  name: migrate-volume
  labels:
    name: migrate-volume
  namespace: jenkins-weekly
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
  containers:
  - image: jenkinsciinfra/packaging:latest
    name: migrate-volume-script
    command: ["rsync"]
    args: ["-a", "--delete", "/var/jenkins_home", "/mnt/"] #will create the destination folder within /mnt/
    volumeMounts:
    - mountPath: /var/jenkins_home
      name: jenkins-home-source
      readOnly: true
    - mountPath: /mnt
      name: jenkins-home-destination
  nodeSelector:
    kubernetes.io/arch: arm64
  tolerations:
    - key: "kubernetes.io/arch"
      operator: "Equal"
      value: "arm64"
      effect: "NoSchedule"
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/instance
            operator: In
            values:
            - jenkins-weekly
        topologyKey: kubernetes.io/hostname
  restartPolicy: Never
  volumes:
  - name: jenkins-home-source
    persistentVolumeClaim:
      claimName: jenkins-weekly
  - name: jenkins-home-destination
    persistentVolumeClaim:
      claimName: jenkins-weekly-data