solo-io / gloo

The Cloud-Native API Gateway and AI Gateway
https://docs.solo.io/
Apache License 2.0
4.1k stars 446 forks source link

Unable to set nodeSelector on job.batch/gloo-resource-rollout #6847

Closed danfinn closed 2 years ago

danfinn commented 2 years ago

Gloo Edge Version

1.12.x (beta)

Kubernetes Version

1.21.x

Describe the bug

Using the documentation here:

https://docs.solo.io/gloo-edge/master/reference/helm_chart_values/open_source_helm_chart_values/

I've set all of the available nodeSelector (there are quite a few) to linux. This has gotten all of my pods up and running however I have a job that is failing to run because it's getting assigned to a Windows pod and I don't know how to set nodeSelector on this job. I tried to use kubectl to patch it however you cannot patch the template section of a job once it's run.

Steps to reproduce the bug

I'm using an ansible playbook to add the helm repo and install the helm chart with the following values set:

---
# Add gloo helm repo
- name: Add gloo chart repo
  kubernetes.core.helm_repository:
    binary_path: "{{ helm_binary_path }}"
    name: gloo
    repo_url: "https://storage.googleapis.com/solo-public-helm"

# Install gloo helm chart
- name: Deploy gloo via helm
  kubernetes.core.helm:
    binary_path: "{{ helm_binary_path }}"
    name: gloo
    chart_ref: "gloo/gloo"
    release_namespace: gloo-system
    create_namespace: true
    values:
      discovery:
        deployment:
          nodeSelector:
            kubernetes.io/os: "linux"
          resources:
            limits:
              cpu: 500m
              memory: 256Mi
            requests:
              cpu: 200m
              memory: 128Mi
      gateway:
        certGenJob:
          nodeSelector:
            kubernetes.io/os: "linux"
        deployment:
          resources:
            limits:
              cpu: 500m
              memory: 256Mi
            requests:
              cpu: 200m
              memory: 128Mi
      gatewayProxies:
        gatewayProxy:
          kind:
            deployment:
              nodeSelector:
                kubernetes.io/os: "linux"
          podTemplate:
            nodeSelector:
              kubernetes.io/os: "linux"
            resources:
              limits:
                cpu: 500m
                memory: 256Mi
              requests:
                cpu: 200m
                memory: 128Mi
      gloo:
        deployment:
          nodeSelector:
            kubernetes.io/os: "linux"
          resources:
            limits:
              cpu: 1000m
              memory: 512Mi
            requests:
              cpu: 500m
              memory: 256Mi
      accessLogger:
        nodeSelector:
          kubernetes.io/os: "linux"
      settings:
        integrations:
          knative:
            proxy:
              nodeSelector:
                kubernetes.io/os: "linux"
      ingress:
        deployment:
          nodeSelector:
            kubernetes.io/os: "linux"
      ingressProxy:
        deployment:
          nodeSelector:
            kubernetes.io/os: "linux"

which results in the job never running because it gets assigned to a windows pod:

kubectl get pods
NAME                            READY   STATUS              RESTARTS   AGE
discovery-54c6688fff-65rrl      1/1     Running             0          34m
gateway-proxy-c76fb6f88-mwg2x   1/1     Running             0          34m
gloo-77897d5cf4-2xr47           1/1     Running             0          34m
gloo-resource-rollout-2w7wh     0/1     ContainerCreating   0          34m
kubectl describe pod gloo-resource-rollout-2w7wh
Name:           gloo-resource-rollout-2w7wh
Namespace:      gloo-system
Priority:       0
Node:           akswp000001/10.14.248.158
Start Time:     Tue, 02 Aug 2022 13:59:51 -0600
Labels:         controller-uid=ee64078b-68a8-4d36-bbd0-321db5e6c702
                gloo=resource-rollout
                job-name=gloo-resource-rollout
Annotations:    <none>
Status:         Pending
IP:
IPs:            <none>
Controlled By:  Job/gloo-resource-rollout
Containers:
  kubectl:
    Container ID:
    Image:         bitnami/kubectl:1.22.9
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      # if validation webhook is enabled, wait for deployment rollout so validation service will be available
      kubectl rollout status deployment -n gloo-system gloo
      # apply Gloo Edge custom resources
      kubectl apply -f - <<EOF || exit $?
      ---

      apiVersion: gateway.solo.io/v1
      kind: Gateway
      metadata:
        name: gateway-proxy
        namespace: gloo-system
        labels:
          app: gloo
      spec:
        bindAddress: "::"
        bindPort: 8080
        httpGateway: {}
        useProxyProto: false
        ssl: false
        proxyNames:
        - gateway-proxy
      ---

      apiVersion: gateway.solo.io/v1
      kind: Gateway
      metadata:
        name: gateway-proxy-ssl
        namespace: gloo-system
        labels:
          app: gloo
      spec:
        bindAddress: "::"
        bindPort: 8443
        httpGateway: {}
        useProxyProto: false
        ssl: true
        proxyNames:
        - gateway-proxy
      EOF

      # remove the resource-policy annotations that were added temporarily by the gloo-resource-migration job during upgrade
      kubectl annotate upstreams.gloo.solo.io -n gloo-system -l app=gloo helm.sh/resource-policy- || exit $?
      kubectl annotate gateways.gateway.solo.io -n gloo-system -l app=gloo helm.sh/resource-policy- || exit $?

    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cjksw (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-cjksw:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age                   From               Message
  ----     ------       ----                  ----               -------
  Normal   Scheduled    34m                   default-scheduler  Successfully assigned gloo-system/gloo-resource-rollout-2w7wh to akswp000001
  Warning  FailedMount  34m                   kubelet            MountVolume.SetUp failed for volume "kube-api-access-cjksw" : chown c:\var\lib\kubelet\pods\4e540cbc-1dc0-40db-a74b-9d53c94b4bc0\volumes\kubernetes.io~projected\kube-api-access-cjksw\..2022_08_02_19_59_52.774512012\token: not supported by windows
  Warning  FailedMount  34m                   kubelet            MountVolume.SetUp failed for volume "kube-api-access-cjksw" : chown c:\var\lib\kubelet\pods\4e540cbc-1dc0-40db-a74b-9d53c94b4bc0\volumes\kubernetes.io~projected\kube-api-access-cjksw\..2022_08_02_19_59_52.891742587\token: not supported by windows
  Warning  FailedMount  34m                   kubelet            MountVolume.SetUp failed for volume "kube-api-access-cjksw" : chown c:\var\lib\kubelet\pods\4e540cbc-1dc0-40db-a74b-9d53c94b4bc0\volumes\kubernetes.io~projected\kube-api-access-cjksw\..2022_08_02_19_59_53.404968606\token: not supported by windows
  Warning  FailedMount  34m                   kubelet            MountVolume.SetUp failed for volume "kube-api-access-cjksw" : chown c:\var\lib\kubelet\pods\4e540cbc-1dc0-40db-a74b-9d53c94b4bc0\volumes\kubernetes.io~projected\kube-api-access-cjksw\..2022_08_02_19_59_55.445568357\token: not supported by windows
  Warning  FailedMount  34m                   kubelet            MountVolume.SetUp failed for volume "kube-api-access-cjksw" : chown c:\var\lib\kubelet\pods\4e540cbc-1dc0-40db-a74b-9d53c94b4bc0\volumes\kubernetes.io~projected\kube-api-access-cjksw\..2022_08_02_19_59_59.644233600\token: not supported by windows
  Warning  FailedMount  34m                   kubelet            MountVolume.SetUp failed for volume "kube-api-access-cjksw" : chown c:\var\lib\kubelet\pods\4e540cbc-1dc0-40db-a74b-9d53c94b4bc0\volumes\kubernetes.io~projected\kube-api-access-cjksw\..2022_08_02_20_00_07.656316639\token: not supported by windows
  Warning  FailedMount  34m                   kubelet            MountVolume.SetUp failed for volume "kube-api-access-cjksw" : chown c:\var\lib\kubelet\pods\4e540cbc-1dc0-40db-a74b-9d53c94b4bc0\volumes\kubernetes.io~projected\kube-api-access-cjksw\..2022_08_02_20_00_24.305427634\token: not supported by windows
  Warning  FailedMount  33m                   kubelet            MountVolume.SetUp failed for volume "kube-api-access-cjksw" : chown c:\var\lib\kubelet\pods\4e540cbc-1dc0-40db-a74b-9d53c94b4bc0\volumes\kubernetes.io~projected\kube-api-access-cjksw\..2022_08_02_20_00_56.171062377\token: not supported by windows
  Warning  FailedMount  12m (x8 over 32m)     kubelet            Unable to attach or mount volumes: unmounted volumes=[kube-api-access-cjksw], unattached volumes=[kube-api-access-cjksw]: timed out waiting for the condition
  Warning  FailedMount  4m19s (x17 over 32m)  kubelet            (combined from similar events): MountVolume.SetUp failed for volume "kube-api-access-cjk

Expected Behavior

There should be a way to specify that this job needs to run on a linux node. Even better would be one value to set that says everything needs to run on linux, it's a bit tedious to have to set this in so many different places.

Additional Context

No response

danfinn commented 2 years ago

The following jobs don't seem to have a way to set nodeSelector:

resource-cleanup resource-migration resource-rollout

Looking at a job that does have this ability, you can see here that the template for the certgen job pulls in the values at line 32: https://github.com/solo-io/gloo/blob/master/install/helm/gloo/templates/6.5-gateway-certgen-job.yaml

The jobs that don't have a way to set nodeSelector on don't do this and therefore there is no way to set this on those jobs. To get around this for now I'm generating the manifest with helm template and then adding nodeSelector to those 3 jobs.

perrymckenzie commented 2 years ago

up vote from me!

danfinn commented 2 years ago

submitted PR which I think will fix this.

https://github.com/solo-io/gloo/pull/6878

jenshu commented 2 years ago

available in GlooEE v1.12.3 (OSS v1.12.3) and GlooEE v1.11.33 (OSS v1.11.28)