jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
16 stars 10 forks source link

[ci.jenkins.io] Container agents in a degraded state #2893

Closed basil closed 2 years ago

basil commented 2 years ago

Service(s)

ci.jenkins.io

Summary

I've now run 3 core PR builds in a row against a Linux JDK 8 Kubernetes container (label: maven) and all three have failed with:

Cannot contact jnlp-maven-8-45549: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@3ba7fb1d:JNLP4-connect connection from ec2-3-143-84-83.us-east-2.compute.amazonaws.com/3.143.84.83:64651": Remote call on JNLP4-connect connection from ec2-3-143-84-83.us-east-2.compute.amazonaws.com/3.143.84.83:64651 failed. The channel is closing down or has closed down`

https://status.jenkins.io/ shows no issues.

Clearly the cluster is in a severely degraded state and monitoring is insufficient.

Reproduction steps

Try to run a core PR build. Observe that the build fails within a minute rather than compiling and testing the code.

dduportal commented 2 years ago

Pleas finde below the notes from the team post-mortem (was due since a few weeks):

Post-Mortem of Outages on ci.jenkins.io (April/May 2022)

Attendees

Introduction

Thanks to @basil, a preliminary analysis work was done prior to this meeting, providing an initial guidance. The sections of these notes materialize the sections of this preliminary as it was the direction of the meeting.

Initial helpdesk issues:

General Process

Agent disconnections

Observable with an error message like the following:

Cannot contact jnlp-maven-8-45549: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@3ba7fb1d:JNLP4-connect connection from ec2-3-143-84-83.us-east-2.compute.amazonaws.com/3.143.84.83:64651": Remote call on JNLP4-connect connection from ec2-3-143-84-83.us-east-2.compute.amazonaws.com/3.143.84.83:64651 failed. The channel is closing down or has closed down`

@basil have seen this problem in many core and plugin PRs. This is fatal to builds, requiring builds to be restarted. This drives up costs.

At a high level, the debugging process goes like this: when this problem is occurring, examine both the controller and the agent:

There seems to be 3 different patterns of disconnections:

Action Items

Deployment of untested change on the infra

Issue jenkins-infra/jenkins-infra#2128

Caused errors like this:

io.jenkins.plugins.casc.ConfiguratorException: Invalid configuration elements for type class org.csanchez.jenkins.plugins.kubernetes.ContainerTemplate : env.
Available attributes : alwaysPullImage, args, command, envVars, image, livenessProbe, name, ports, privileged, resourceLimitCpu, resourceLimitEphemeralStorage, resourceLimitMemory, resourceRequestCpu, resourceRequestEphemeralStorage, resourceRequestMemory, runAsGroup, runAsUser, shell, ttyEnabled, workingDir

The cause is clear: the change was not tested and contains a syntax error. Fixed by collaborting efficiently, thanks y'all!

Action Items

Low disk space

We resized the disk from 300 to 500 GiB as it was filled at 91%. The goal was to get this partition below the dreaded 80% threshold for performance.

Action Items

Suspended Kubernetes Agents

Many "suspended" nodes with errors like the below but no traces of these pods in the controller UI. In the Kubernetes pod list, all pods are in a valid state with no log errors.

903de-aeb0-410d-ad22-8c8eb88071a5', name='jnlp-maven-11', slaveConnectTimeout=100, label='container kubernetes cik8s maven-11 jdk11', containers=[ContainerTemplate{name='jnlp', image='jenkinsciinfra/inbound-agent-maven@sha256:85af372d080c2e4d15d55f8833a6eb5a84f3f206368da002a048ccf49e302c23', workingDir='/home/jenkins', command='/usr/local/bin/jenkins-agent', args='', resourceRequestCpu='4', resourceRequestMemory='8G', resourceRequestEphemeralStorage='', resourceLimitCpu='4', resourceLimitMemory='8G', resourceLimitEphemeralStorage='', livenessProbe=ContainerLivenessProbe{execArgs='', timeoutSeconds=0, initialDelaySeconds=0, failureThreshold=0, periodSeconds=0, successThreshold=0}}], imagePullSecrets=[PodImagePullSecret{name='dockerhub-credential'}]}
2022-04-20 20:04:23.768+0000 [id=718347]    INFO    o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent jnlp-maven-11-kh5b0
2022-04-20 20:04:23.796+0000 [id=718347]    SEVERE    o.c.j.p.k.KubernetesSlave#_terminate: Computer for agent is null: jnlp-maven-11-kh5b0
2022-04-20 20:04:23.796+0000 [id=718347]    INFO    hudson.slaves.AbstractCloudSlave#terminate: FATAL: Computer for agent is null: jnlp-maven-11-kh5b0
2022-04-20 20:04:28.017+0000 [id=734210]    INFO    h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #28,715 from /3.143.84.83:51530

Action Items

=> Priority: let's park this issue in background as it does not seem to be blocking and prone to happen

Jenkins unresponsive

VM is unresponsive (all CPUs at 100%, Jenkins is waiting on I/Os). Triggered a reboot.

Action Items

Proposal from Infra team

Overall: Infra team is waiting for @basil's feedbacks or requests (access to systems, installations, etc.), unless of course not enough time.

[Post Meeting]

dduportal commented 2 years ago

Closing as no more outages + no more actions left for now.

Tracking improvement on the kubernetes plugin in https://github.com/jenkins-infra/helpdesk/issues/2964