jenkinsci / helm-charts

Jenkins helm charts
https://artifacthub.io/packages/helm/jenkinsci/jenkins
Apache License 2.0
561 stars 882 forks source link

Jenkins cannot create slave agent in EKS 1.29 #1017

Open KosShutenko opened 6 months ago

KosShutenko commented 6 months ago

Describe the bug

I have EKS cluster with 1.29 version. I've installed Jenkins helm chart (latest version) via FluxCD. From values I've changed Ingress only. After Jenkins installation I've tested Kubernetes connections and Its OK.

But test-job with default (proposed by Jenkins) pipeline cannot be executed. I don't see any pods with agent started in Jenkins namespace.

In console output I see:

Started by user Jenkins Admin
[Pipeline] Start of Pipeline
[Pipeline] podTemplate
[Pipeline] {
[Pipeline] node
Still waiting to schedule task
‘test-job-1-fd6tk-tcgf3-w2tts’ is offline
ERROR: Failed to launch test-job-1-fd6tk-tcgf3-w2tts
java.io.IOException: Canceled
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
Caused: java.io.InterruptedIOException: timeout
    at okhttp3.internal.connection.RealCall.timeoutExit(RealCall.kt:398)
    at okhttp3.internal.connection.RealCall.callDone(RealCall.kt:360)
    at okhttp3.internal.connection.RealCall.noMoreExchanges$okhttp(RealCall.kt:325)
    at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:209)
    at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517)
Caused: java.io.IOException: timeout
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:504)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:340)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:753)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:97)
    at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42)
Caused: io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:129)
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:122)
    at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:44)
    at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:133)
    at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)
    at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
    at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)

In jenkins-0 pod (jenkins container) I see the following logs:

2024-02-20 07:23:31.605+0000 [id=104]   INFO    hudson.slaves.NodeProvisioner#update: test-job-1-fd6tk-tcgf3-w2tts provisioning successfully completed. We have now 2 computer(s)
2024-02-20 07:23:31.640+0000 [id=103]   INFO    o.c.j.p.k.pod.retention.Reaper#watchCloud: set up watcher on kubernetes
2024-02-20 07:26:36.021+0000 [id=103]   WARNING o.c.j.p.k.KubernetesLauncher#launch: Kubernetes returned unhandled HTTP code -1 null
2024-02-20 07:26:36.129+0000 [id=103]   WARNING o.c.j.p.k.KubernetesLauncher#launch: Error in provisioning; agent=KubernetesSlave name: test-job-1-fd6tk-tcgf3-w2tts, template=PodTemplate{id='8b8ce8ed-a266-4ae3-8795-2243187ec290', name='test-job_1-fd6tk-tcgf3', namespace='jenkins', label='test-job_1-fd6tk', annotations=[PodAnnotation{key='buildUrl', value='http://jenkins.jenkins.svc.cluster.local:8080/job/test-job/1/'}, PodAnnotation{key='runUrl', value='job/test-job/1/'}]}
java.io.IOException: Canceled
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
Caused: java.io.InterruptedIOException: timeout
    at okhttp3.internal.connection.RealCall.timeoutExit(RealCall.kt:398)
    at okhttp3.internal.connection.RealCall.callDone(RealCall.kt:360)
    at okhttp3.internal.connection.RealCall.noMoreExchanges$okhttp(RealCall.kt:325)
    at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:209)
    at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517)
Caused: java.io.IOException: timeout
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:504)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:340)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:753)
    at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:97)
    at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42)
Caused: io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:129)
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:122)
    at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:44)
    at org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher.launch(KubernetesLauncher.java:133)
    at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)
    at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
    at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)

Version of Helm and Kubernetes

- Helm: v3.14.0"
- Kubernetes: v1.27.2

Chart version

jenkins-5.0.13

What happened?

1. Install Jenkins helm chart on EKS 1.29
2. Create test pipeline (Kubernetes) job
3. Check logs

What you expected to happen?

I have another Jenkins installations on 1.26-1.27 GKE clusters and it works fine. Jenkins creates agents pods and exec pipelines.

How to reproduce it

controller:
      ingress:
        enabled: true
        apiVersion: "networking.k8s.io/v1"
        annotations:
          nginx.ingress.kubernetes.io/rewrite-target: /
          cert-manager.io/cluster-issuer: letsencrypt-dns-prod
          nginx.ingress.kubernetes.io/server-snippets: |
            location / {
              proxy_set_header Upgrade $http_upgrade;
              proxy_http_version 1.1;
              proxy_set_header X-Forwarded-Host $http_host;
              proxy_set_header X-Forwarded-Proto $scheme;
              proxy_set_header X-Forwarded-For $remote_addr;
              proxy_set_header Host $host;
              proxy_set_header Connection "upgrade";
              proxy_cache_bypass $http_upgrade;
              }
        hostName: jenkins.cloud.company.pro
        tls:
          - secretName: tls-secret-jenkins-cloud-company-pro
            hosts:
              - jenkins.cloud.company.pro

Anything else we need to know?

No response

jpriebe commented 1 month ago

@KosShutenko - have you found a workaround for this? We upgraded EKS to 1.29 yesterday, and we are seeing the exact same errors.

Actually, here's a little more info - we were already running EKS 1.29, and jenkins was working. But we updated our nodes from Amazon Linux 2 to Amazon Linux 2023, and we are getting the same error you documented.

timja commented 1 month ago

I would raise this with the kubernetes-plugin, it doesn't look related to the helm chart

jpriebe commented 1 month ago

@KosShutenko - have you found a workaround for this? We upgraded EKS to 1.29 yesterday, and we are seeing the exact same errors.

Actually, here's a little more info - we were already running EKS 1.29, and jenkins was working. But we updated our nodes from Amazon Linux 2 to Amazon Linux 2023, and we are getting the same error you documented.

Quick update on this -- it turns out that it was a new cluster component we had added recently -- the vertical pod autoscaler (https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler).

The VPA installs an admission controller webhook. It was that webhook that was timing out, causing the pod creation API call to timeout.

@KosShutenko - your problem may not be the VPA, but you might want to look at all your mutating webhooks:

kubectl get MutatingWebhookConfiguration -A

I would also suggest you look at your kubernetes API logs in cloudwatch for more clues. In my case, I found log entries like this:

Failed calling webhook, failing open vpa.k8s.io: failed calling webhook "vpa.k8s.io": failed to call webhook: Post "
[https://vpa-webhook.vpa.svc:443/?timeout=30s":](https://vpa-webhook.vpa.svc/?timeout=30s%22:)
net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

that led me to identify VPA as the culprit.