basil commented 2 years ago

Service(s)

ci.jenkins.io

Summary

I've now run 3 core PR builds in a row against a Linux JDK 8 Kubernetes container (label: maven) and all three have failed with:

Cannot contact jnlp-maven-8-45549: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@3ba7fb1d:JNLP4-connect connection from ec2-3-143-84-83.us-east-2.compute.amazonaws.com/3.143.84.83:64651": Remote call on JNLP4-connect connection from ec2-3-143-84-83.us-east-2.compute.amazonaws.com/3.143.84.83:64651 failed. The channel is closing down or has closed down`

https://status.jenkins.io/ shows no issues.

Clearly the cluster is in a severely degraded state and monitoring is insufficient.

Reproduction steps

Try to run a core PR build. Observe that the build fails within a minute rather than compiling and testing the code.

dduportal commented 2 years ago

Pleas finde below the notes from the team post-mortem (was due since a few weeks):

Post-Mortem of Outages on ci.jenkins.io (April/May 2022)

Outage on April 2?, 2022
Outage on May 2, 2022

Attendees

Mark Waite
Basil Crow
Damien DUPORTAL

Introduction

Thanks to @basil, a preliminary analysis work was done prior to this meeting, providing an initial guidance. The sections of these notes materialize the sections of this preliminary as it was the direction of the meeting.

Initial helpdesk issues:

Outage 1: https://github.com/jenkins-infra/helpdesk/issues/2893
Outage 2: https://github.com/jenkins-infra/helpdesk/issues/2908

General Process

File separate ticket to track long-term resolution of each issue (problem -> evaluation -> solution -> implementation).
If a short-term workaround is needed, file a separate ticket to cover the reason a workaround is needed, the implementation of the workaround, and a third ticket to cover removal of the workaround.
- Question from Basil: "do you agree on this process?"
  - Both Mark and Damien agree on keeping it: it solves the "keep information"
- Let's use the GitHub issue feature that permit creating issues from comment's bullet list (hover your cursor on the rendered bullet list and click the button that appears on the right of the bullet list item). It provides a back link to original issue: useful to keep knowledge

Agent disconnections

Observable with an error message like the following:

Cannot contact jnlp-maven-8-45549: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@3ba7fb1d:JNLP4-connect connection from ec2-3-143-84-83.us-east-2.compute.amazonaws.com/3.143.84.83:64651": Remote call on JNLP4-connect connection from ec2-3-143-84-83.us-east-2.compute.amazonaws.com/3.143.84.83:64651 failed. The channel is closing down or has closed down`

@basil have seen this problem in many core and plugin PRs. This is fatal to builds, requiring builds to be restarted. This drives up costs.

At a high level, the debugging process goes like this: when this problem is occurring, examine both the controller and the agent:

If the build is still running on the agent, then determine why the connection was interrupted and why the agent did not reconnect to the controller after the connection was interrupted. (The connection could have been interrupted due to a networking issue or due to resource exhaustion on the agent or controller. The connection may have failed to resume due to a software bug.)
If the build is not still running on the agent, determine why it stopped. (The build may have stopped running on the agent due to resource exhaustion on the agent.)

There seems to be 3 different patterns of disconnections:

EC2/Azure VM agents
Windows containers (on ACS)
Linux containers on (EKS/DoKS)

Action Items

(short-term)TODO: give access to Kubernetes clusters of ci.jenkins.io (EKS and DOKS) to Basil
- Helpdesk Issue for tracking: https://github.com/jenkins-infra/helpdesk/issues/2946
(short-term): Catch this problem in the act (not hard, just requires patience) and start debugging as described above.
(medium-term): TBD based on analysis
(long-term): TBD based on analysis

Deployment of untested change on the infra

Issue jenkins-infra/jenkins-infra#2128

Caused errors like this:

io.jenkins.plugins.casc.ConfiguratorException: Invalid configuration elements for type class org.csanchez.jenkins.plugins.kubernetes.ContainerTemplate : env.
Available attributes : alwaysPullImage, args, command, envVars, image, livenessProbe, name, ports, privileged, resourceLimitCpu, resourceLimitEphemeralStorage, resourceLimitMemory, resourceRequestCpu, resourceRequestEphemeralStorage, resourceRequestMemory, runAsGroup, runAsUser, shell, ttyEnabled, workingDir

The cause is clear: the change was not tested and contains a syntax error. Fixed by collaborting efficiently, thanks y'all!

Action Items

(short-term): None, the syntax was corrected.
(medium-term): We should do proper testing in the future. :-)
- Team-wide discussion:
- somes changes are costly or hard to test (JCasc yaml can be linted, but we need trigger an environment to test a given element), so the strategy is to iterate quickly, on tiny amounts so rollbacks are non-events and easy. However breaking things is frustrating for users so room for improvement team-wide.
- There isn't a staging environment (cost and maintenance reasons)
- For Kubernetes-related changes we absolutely should improve as it's easy to spin up k3s clusters on PRs
- For puppet, there is a vagrant based testing that involves spawning a virtualbox VM with vagrant, where the puppet profiles can be applied (but only on Intel machines) + rspec unit tests
(long-term): Improve contribution process to make testing easier.

Low disk space

We resized the disk from 300 to 500 GiB as it was filled at 91%. The goal was to get this partition below the dreaded 80% threshold for performance.

Action Items

(short-term): None, the disk has been resized.
(medium-term): Collect the output of sudo ncdu -o /tmp/output.txt -x ${JENKINS_HOME} and begin analysis. Already done by @basil
(long-term): audit of the jenkins_home disk usage:
- Some branches of jobs are producing quite the amount of data
- There is a UX Core PR producing a LOT of logs.
  - Install "log file size" plugin (https://plugins.jenkins.io/logfilesizechecker/) if needed.

Suspended Kubernetes Agents

Many "suspended" nodes with errors like the below but no traces of these pods in the controller UI. In the Kubernetes pod list, all pods are in a valid state with no log errors.

903de-aeb0-410d-ad22-8c8eb88071a5', name='jnlp-maven-11', slaveConnectTimeout=100, label='container kubernetes cik8s maven-11 jdk11', containers=[ContainerTemplate{name='jnlp', image='jenkinsciinfra/inbound-agent-maven@sha256:85af372d080c2e4d15d55f8833a6eb5a84f3f206368da002a048ccf49e302c23', workingDir='/home/jenkins', command='/usr/local/bin/jenkins-agent', args='', resourceRequestCpu='4', resourceRequestMemory='8G', resourceRequestEphemeralStorage='', resourceLimitCpu='4', resourceLimitMemory='8G', resourceLimitEphemeralStorage='', livenessProbe=ContainerLivenessProbe{execArgs='', timeoutSeconds=0, initialDelaySeconds=0, failureThreshold=0, periodSeconds=0, successThreshold=0}}], imagePullSecrets=[PodImagePullSecret{name='dockerhub-credential'}]}
2022-04-20 20:04:23.768+0000 [id=718347]    INFO    o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent jnlp-maven-11-kh5b0
2022-04-20 20:04:23.796+0000 [id=718347]    SEVERE    o.c.j.p.k.KubernetesSlave#_terminate: Computer for agent is null: jnlp-maven-11-kh5b0
2022-04-20 20:04:23.796+0000 [id=718347]    INFO    hudson.slaves.AbstractCloudSlave#terminate: FATAL: Computer for agent is null: jnlp-maven-11-kh5b0
2022-04-20 20:04:28.017+0000 [id=734210]    INFO    h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #28,715 from /3.143.84.83:51530

Action Items

Possible cause: https://issues.jenkins.io/browse/JENKINS-68126 (thanks Hervé!)
=> We feel that DockerHub API rate limit increase should push away this issue. But if it happens again, could be worth it to diagnose deeper.
(short-term): Try to reproduce in a staging environment; if reproducible, verify that the analysis from JENKINS-68126 matches the thread dump from the Jenkins controller.
(medium-term): If analysis matches, proceed with development and deployment of fix for JENKINS-68126; otherwise, analyze further by connecting a debugger to Jenkins controller
(long-term): Based on analysis

=> Priority: let's park this issue in background as it does not seem to be blocking and prone to happen

Jenkins unresponsive

VM is unresponsive (all CPUs at 100%, Jenkins is waiting on I/Os). Triggered a reboot.

Action Items

(short-term): None, service has been restored.
(medium-term): Develop a runbook for how to collect debugging state when CPU exhaustion occurs. At a high level, get a thread dump and CPU flame graph. The latter requires the image to be preconfigured with async-profiler.
(long-term): TBD based on analysis

Proposal from Infra team

If we can pinpoint the issue to the "dependabot" peak on sundays, then maybe add a lock on BOM
- To be challenged though: we need facts for this. By design Jenkins should be able to handle this but might be bugs to be fixed in certain cases
Build of other contributors competing with BOM builds if we test a peak?
- Can we add prioritization through a plugin or specific setup (Priority Sorter Plugin)?
- Add VM templates for labels but makes it harder to diagnose problems
- Why not splitting workloads on different clusters to ensure build are treated?

Overall: Infra team is waiting for @basil's feedbacks or requests (access to systems, installations, etc.), unless of course not enough time.

[Post Meeting]

https://github.com/jenkins-infra/helpdesk/issues/2912 raised by Basil, implemented recently, related to adding rngd (to improve random number generation efficiency). Done by the infra team.

dduportal commented 2 years ago

Closing as no more outages + no more actions left for now.

Tracking improvement on the kubernetes plugin in https://github.com/jenkins-infra/helpdesk/issues/2964

jenkins-infra / helpdesk

[ci.jenkins.io] Container agents in a degraded state #2893

Service(s)

Summary

Reproduction steps

Post-Mortem of Outages on ci.jenkins.io (April/May 2022)

Attendees

Introduction

General Process

Agent disconnections

Deployment of untested change on the infra

Low disk space

Suspended Kubernetes Agents

Jenkins unresponsive

Proposal from Infra team