Closed basil closed 2 years ago
Pleas finde below the notes from the team post-mortem (was due since a few weeks):
Thanks to @basil, a preliminary analysis work was done prior to this meeting, providing an initial guidance. The sections of these notes materialize the sections of this preliminary as it was the direction of the meeting.
Initial helpdesk issues:
Observable with an error message like the following:
Cannot contact jnlp-maven-8-45549: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@3ba7fb1d:JNLP4-connect connection from ec2-3-143-84-83.us-east-2.compute.amazonaws.com/3.143.84.83:64651": Remote call on JNLP4-connect connection from ec2-3-143-84-83.us-east-2.compute.amazonaws.com/3.143.84.83:64651 failed. The channel is closing down or has closed down`
@basil have seen this problem in many core and plugin PRs. This is fatal to builds, requiring builds to be restarted. This drives up costs.
At a high level, the debugging process goes like this: when this problem is occurring, examine both the controller and the agent:
There seems to be 3 different patterns of disconnections:
Action Items
Issue jenkins-infra/jenkins-infra#2128
Caused errors like this:
io.jenkins.plugins.casc.ConfiguratorException: Invalid configuration elements for type class org.csanchez.jenkins.plugins.kubernetes.ContainerTemplate : env.
Available attributes : alwaysPullImage, args, command, envVars, image, livenessProbe, name, ports, privileged, resourceLimitCpu, resourceLimitEphemeralStorage, resourceLimitMemory, resourceRequestCpu, resourceRequestEphemeralStorage, resourceRequestMemory, runAsGroup, runAsUser, shell, ttyEnabled, workingDir
The cause is clear: the change was not tested and contains a syntax error. Fixed by collaborting efficiently, thanks y'all!
Action Items
We resized the disk from 300 to 500 GiB as it was filled at 91%. The goal was to get this partition below the dreaded 80% threshold for performance.
Action Items
Many "suspended" nodes with errors like the below but no traces of these pods in the controller UI. In the Kubernetes pod list, all pods are in a valid state with no log errors.
903de-aeb0-410d-ad22-8c8eb88071a5', name='jnlp-maven-11', slaveConnectTimeout=100, label='container kubernetes cik8s maven-11 jdk11', containers=[ContainerTemplate{name='jnlp', image='jenkinsciinfra/inbound-agent-maven@sha256:85af372d080c2e4d15d55f8833a6eb5a84f3f206368da002a048ccf49e302c23', workingDir='/home/jenkins', command='/usr/local/bin/jenkins-agent', args='', resourceRequestCpu='4', resourceRequestMemory='8G', resourceRequestEphemeralStorage='', resourceLimitCpu='4', resourceLimitMemory='8G', resourceLimitEphemeralStorage='', livenessProbe=ContainerLivenessProbe{execArgs='', timeoutSeconds=0, initialDelaySeconds=0, failureThreshold=0, periodSeconds=0, successThreshold=0}}], imagePullSecrets=[PodImagePullSecret{name='dockerhub-credential'}]}
2022-04-20 20:04:23.768+0000 [id=718347] INFO o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent jnlp-maven-11-kh5b0
2022-04-20 20:04:23.796+0000 [id=718347] SEVERE o.c.j.p.k.KubernetesSlave#_terminate: Computer for agent is null: jnlp-maven-11-kh5b0
2022-04-20 20:04:23.796+0000 [id=718347] INFO hudson.slaves.AbstractCloudSlave#terminate: FATAL: Computer for agent is null: jnlp-maven-11-kh5b0
2022-04-20 20:04:28.017+0000 [id=734210] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #28,715 from /3.143.84.83:51530
Action Items
Possible cause: https://issues.jenkins.io/browse/JENKINS-68126 (thanks Hervé!)
=> We feel that DockerHub API rate limit increase should push away this issue. But if it happens again, could be worth it to diagnose deeper.
(short-term): Try to reproduce in a staging environment; if reproducible, verify that the analysis from JENKINS-68126 matches the thread dump from the Jenkins controller.
(medium-term): If analysis matches, proceed with development and deployment of fix for JENKINS-68126; otherwise, analyze further by connecting a debugger to Jenkins controller
(long-term): Based on analysis
=> Priority: let's park this issue in background as it does not seem to be blocking and prone to happen
VM is unresponsive (all CPUs at 100%, Jenkins is waiting on I/Os). Triggered a reboot.
Action Items
If we can pinpoint the issue to the "dependabot" peak on sundays, then maybe add a lock on BOM
Build of other contributors competing with BOM builds if we test a peak?
Overall: Infra team is waiting for @basil's feedbacks or requests (access to systems, installations, etc.), unless of course not enough time.
[Post Meeting]
Closing as no more outages + no more actions left for now.
Tracking improvement on the kubernetes plugin in https://github.com/jenkins-infra/helpdesk/issues/2964
Service(s)
ci.jenkins.io
Summary
I've now run 3 core PR builds in a row against a Linux JDK 8 Kubernetes container (label:
maven
) and all three have failed with:https://status.jenkins.io/ shows no issues.
Clearly the cluster is in a severely degraded state and monitoring is insufficient.
Reproduction steps
Try to run a core PR build. Observe that the build fails within a minute rather than compiling and testing the code.