[ci.jenkins.io] Container agents in a degraded state

basil commented 2 years ago

Service(s)

ci.jenkins.io

Summary

I've now run 3 core PR builds in a row against a Linux JDK 8 Kubernetes container (label: maven) and all three have failed with:

Cannot contact jnlp-maven-8-45549: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@3ba7fb1d:JNLP4-connect connection from ec2-3-143-84-83.us-east-2.compute.amazonaws.com/3.143.84.83:64651": Remote call on JNLP4-connect connection from ec2-3-143-84-83.us-east-2.compute.amazonaws.com/3.143.84.83:64651 failed. The channel is closing down or has closed down`

https://status.jenkins.io/ shows no issues.

Clearly the cluster is in a severely degraded state and monitoring is insufficient.

Reproduction steps

Try to run a core PR build. Observe that the build fails within a minute rather than compiling and testing the code.

basil commented 2 years ago

The acute problem I was facing seems to have now cleared up:

https://ci.jenkins.io/job/Core/job/jenkins/job/PR-6499/5/console

This ticket can be closed, since the short-term pain is gone. I am still trying to think of the best way to file a ticket to track the long-term pain of agent instability we are seeing.

dduportal commented 2 years ago

Some elements to add:

There is a lot of jnlp agent that are in "idle" state. It's weird, no idea why/what/how.
I have zero idea how to monitor such behaviors. Maybe collecting all logs and have a sytem grepping it for such errors? Monitoring Jenkins have always been a pain, any idea welcome to be honest.
DockerHub suffered issues ~2 hours ago, it was timeout to pull images: might be related.

dduportal commented 2 years ago

https://status.jenkins.io/ shows no issues.

It's basically a static website: there isn't really automation on this. Do not hesitate to open a PR on it if you see such pain in agents allocation: that would help other users a lot.

dduportal commented 2 years ago

For info, I've triggered a "reload from disk". It usually trigger whatever "agent garbage collector" process in Jenkins usually.

dduportal commented 2 years ago

Currently checking the logs to see what is going wrong

dduportal commented 2 years ago

A lot of errors

io.jenkins.plugins.casc.ConfiguratorException: Invalid configuration elements for type class org.csanchez.jenkins.plugins.kubernetes.ContainerTemplate : env.
Available attributes : alwaysPullImage, args, command, envVars, image, livenessProbe, name, ports, privileged, resourceLimitCpu, resourceLimitEphemeralStorage, resourceLimitMemory, resourceRequestCpu, resourceRequestEphemeralStorage, resourceRequestMemory, runAsGroup, runAsUser, shell, ttyEnabled, workingDir

dduportal commented 2 years ago

Stopped puppet on the VM to validate the correct casc syntax.

[async ping @lemeurherve @smerle33 : it looks like that your work on doks is also generating a lot of errors. It will be deleted by the operation I'm doing right now]

basil commented 2 years ago

[…] Invalid configuration elements […] : env. Available attributes : […] envVars […]

=> https://github.com/jenkins-infra/jenkins-infra/pull/2129

dduportal commented 2 years ago

Many thanks @basil for reporting the issue + sending the PR! I've added a suggestion that I just tested in the production.

dduportal commented 2 years ago

Puppet enabled
@basil PR merged ❤️
Puppet currently applying change, monitoring the logs.

dduportal commented 2 years ago

Change applied with success. New pods are triggered (casc reload went well).

However, still a lot of "suspended" pod. Weird state, with a LOT of error messages like

903de-aeb0-410d-ad22-8c8eb88071a5', name='jnlp-maven-11', slaveConnectTimeout=100, label='container kubernetes cik8s maven-11 jdk11', containers=[ContainerTemplate{name='jnlp', image='jenkinsciinfra/inbound-agent-maven@sha256:85af372d080c2e4d15d55f8833a6eb5a84f3f206368da002a048ccf49e302c23', workingDir='/home/jenkins', command='/usr/local/bin/jenkins-agent', args='', resourceRequestCpu='4', resourceRequestMemory='8G', resourceRequestEphemeralStorage='', resourceLimitCpu='4', resourceLimitMemory='8G', resourceLimitEphemeralStorage='', livenessProbe=ContainerLivenessProbe{execArgs='', timeoutSeconds=0, initialDelaySeconds=0, failureThreshold=0, periodSeconds=0, successThreshold=0}}], imagePullSecrets=[PodImagePullSecret{name='dockerhub-credential'}]}
2022-04-20 20:04:23.768+0000 [id=718347]    INFO    o.c.j.p.k.KubernetesSlave#_terminate: Terminating Kubernetes instance for agent jnlp-maven-11-kh5b0
2022-04-20 20:04:23.796+0000 [id=718347]    SEVERE  o.c.j.p.k.KubernetesSlave#_terminate: Computer for agent is null: jnlp-maven-11-kh5b0
2022-04-20 20:04:23.796+0000 [id=718347]    INFO    hudson.slaves.AbstractCloudSlave#terminate: FATAL: Computer for agent is null: jnlp-maven-11-kh5b0
2022-04-20 20:04:28.017+0000 [id=734210]    INFO    h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #28,715 from /3.143.84.83:51530

But no traces of these pod in the UI or in the kubernetes pod list.

dduportal commented 2 years ago

VM is unresponsive (all CPUs at 100%, Jenkins is waiting on I/Os). Triggering a reboot.

dduportal commented 2 years ago

VM rebooted, Jenkins started againt. Still a lot of these weird "suspended" pods.

dduportal commented 2 years ago

Applied last puppet changes / plugins upgrades / Jenkins container restart

dduportal commented 2 years ago

Aaaaaand we are DockerHub rate limited of course

dduportal commented 2 years ago

All pods deleted. Jenkins created a set of new ones \o/

But still getting rate limit from DockerHub. Gotta wait for the next quota in a ~2 hours.

basil commented 2 years ago

Sounds like a good time for me to triage UX regressions. =)

dduportal commented 2 years ago

ci.jenkins.io is restarted again, machine was 100% CPU, with a thousands builds in the queue. Still the BOm builds :'(

dduportal commented 2 years ago

OK, it seems that ci.jenkins.io keeps restarting the BOM PR builds. The following error message happens a lot in the logs:

Queue item for node block in Tools » bom » PR-1034 #1 is missing (perhaps JENKINS-34281); rescheduling

dduportal commented 2 years ago

After restarting the VM, had to delete the $JENKINS_HOME/nodes/jnlp-* to have it starting without error. Seems fine now.

dduportal commented 2 years ago

As per #2894, there were other issues (job scannings not being triggered). Another proper restart of the container in the VM seems to solve this issue.

Let's wait for https://github.com/jenkins-infra/helpdesk/issues/2866 to be applied to both Kubernetes cluster before triggering a bom build again.

dduportal commented 2 years ago

Currently operating the data disk of the VM:

Resizing from 300 to 500Gb as it was filled at 91% (could do a garbage collection but Jenkins being stopped, it is hard). The goal is to get this partition below the dreaded 80% threshold (performances)
Changing disk type from "HDD" to "premium SSD" to boost performances

basil commented 2 years ago

Resizing from 300 to 500Gb as it was filled at 91%

Nice! As a former operator, I understand that disks filling up is a long-term pain point and that it can be difficult to get developers to reduce unnecessary consumption. If, once the dust has settled, you run sudo ncdu -o /tmp/output.txt -x ${JENKINS_HOME} and send me output.txt privately, I can take a look at existing consumption and offer suggestions or PRs to reduce consumption going forward.

dduportal commented 2 years ago

Thanks @basil ! That is a subject that I used to delegate to @MarkEWaite and @timja (as they know what can be deleted or not without bothering the contributors). We'll share this information, we (infra team) are interested into some knowledge sharing on that so we could decide what to "cleanup" in the future (instead of infinite increase ;) ).

dduportal commented 2 years ago

Jenkins is starting: currently scanning the whole JENKINS_HOME, gotta take some times

basil commented 2 years ago

("Infinite increase" in consumption is a common pattern, unfortunately encouraged by cloud providers who stand to make a large amount of profit from this. Maintaining existing levels consumption or decreasing consumption is much harder. I think this is because increases in consumption are an effect typically several layers removed from the cause. Tracing an increase in consumption back to the cause (especially in a distributed system) is significantly more difficult than e.g. debugging a Java stack trace, where each element in the causal chain is presented in the error message. That additional difficulty, in my opinion, explains why people often choose to increase their consumption instead.)

dduportal commented 2 years ago

ci.jenkins.io is back
both EKS and Digital ocean clusters are re-configured (and upgraded to Kube 1.21) and ready but stil:
Kubernetes agent pods are created, running but Jenkins keeps not using them. No error obvious error logs. I have no idea what to do now.

basil commented 2 years ago

Agents seem to be slowly coming up in https://ci.jenkins.io/job/Tools/job/bom/job/PR-1032/2/console so ramp-up appears to be in progress. Assuming we don't eventually overload the system and crash, so far so good. (If we do, having captured thread dumps during the ramp-up period would help in debugging.) Let me know if you want to debug together over a screen share.

dduportal commented 2 years ago

Many thanks for the help proposal @basil . It looks like that the ramp up is in progress, let's wait ~30 min to see how it behaves, now that the clusters are at full scale.

I wonder if we shouldn't add a lock on the BOM builds, cross-branches, given the huge impact it has each time there is a build storm. WDYT?

basil commented 2 years ago

Possibly, depending on the cause. I do not yet fully understand the symptoms of the problem here, let alone the cause. It is possible that the robustness or scalability of the software (i.e., OSS Jenkins core and plugins) could be improved, which would be a more difficult task but would have much higher long-term value than implementing deployment-specific workarounds in the form of locks.

dduportal commented 2 years ago

The queue is (slowly) decreasing:

basil commented 2 years ago

But only because jobs like https://ci.jenkins.io/job/Tools/job/bom/job/PR-1037/2/console are hitting https://github.com/jenkinsci/workflow-durable-task-step-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/support/pickles/ExecutorPickle.java#L131-L133= and being aborted. That in and of itself is a issue: the job should have resumed without interruption, but it did not (something I have complained about before). This issue has the perverse side effect of driving down load and possibly hiding other issues.

jetersen commented 2 years ago

I think we broke it again 😓

dduportal commented 2 years ago

But only because jobs like https://ci.jenkins.io/job/Tools/job/bom/job/PR-1037/2/console are hitting https://github.com/jenkinsci/workflow-durable-task-step-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/support/pickles/ExecutorPickle.java#L131-L133= and being aborted. That in and of itself is a issue: the job should have resumed without interruption, but it did not (something I have complained about before). This issue has the perverse side effect of driving down load and possibly hiding other issues.

I'm sorry @basil but I have no idea what is this "pickle" stuff about. I trust you that you identify what does it causes but it make no sense to me 😅

dduportal commented 2 years ago

I think we broke it again 😓

@jetersen Do you have a job not starting or waiting since hours?

jetersen commented 2 years ago

@dduportal https://github.com/jenkinsci/bom/pull/1029 https://ci.jenkins.io/blue/organizations/jenkins/Tools%2Fbom/detail/PR-1029/2/pipeline https://github.com/jenkinsci/bom/pull/1035 https://ci.jenkins.io/blue/organizations/jenkins/Tools%2Fbom/detail/PR-1035/1/pipeline https://github.com/jenkinsci/bom/pull/1038 https://ci.jenkins.io/blue/organizations/jenkins/Tools%2Fbom/detail/PR-1038/1/pipeline

dduportal commented 2 years ago

@dduportal jenkinsci/bom#1029 https://ci.jenkins.io/blue/organizations/jenkins/Tools%2Fbom/detail/PR-1029/2/pipeline jenkinsci/bom#1035 https://ci.jenkins.io/blue/organizations/jenkins/Tools%2Fbom/detail/PR-1035/1/pipeline jenkinsci/bom#1038 https://ci.jenkins.io/blue/organizations/jenkins/Tools%2Fbom/detail/PR-1038/1/pipeline

Currently checking and running these builds one at a time (I understand there are long term fixes but each wave of dependanbot PRs on the BOM is taking a toll on ci.jenkins.io but having a lock to only allow 1 PR build of the BOM at a time for now would help a lot, given the amount of power that this builds requires).

dduportal commented 2 years ago

OK, so even with 1 bom build, still this "weird" behavior of suspended pods, while the pods are running (with their logs telling us it is connected to Jenkins controller). These pods are trashed after a 5 min timeout, with no build handled.

No obvious log so not sure what is happening and how to troublshoot this.

Let's wait for the result.

dduportal commented 2 years ago

Sounds like it's more a matter of patience: the build is currently being processed.

Some numbers though: a single bom builds generate roughly ~170 parallel branches, each one scheduling a pod. While the maxium amount of pods that we can have is ~ 150 (120 in EKS, 30 in Digital Ocean).

We need to allow the BOM builds to be scheduled on something else than only container + add a lock to sequentialize these builds.

basil commented 2 years ago

I'm sorry @basil but I have no idea what is this "pickle" stuff about. I trust you that you identify what does it causes but it make no sense to me :sweat_smile:

Identifying the cause of failures in an application requires being able to read and understand the application code. If you are uncomfortable doing that, no problem; simply decline to do a retrospective. I am happy to take on the task if desired.

We need to […] add a lock to sequentialize these builds.

As stated in https://github.com/jenkins-infra/helpdesk/issues/2893#issuecomment-1105461838, it is premature to come to a conclusion about what action to take without clearly defining the symptoms of the problem and its cause.

dduportal commented 2 years ago

I'm sorry @basil but I have no idea what is this "pickle" stuff about. I trust you that you identify what does it causes but it make no sense to me 😅

Identifying the cause of failures in an application requires being able to read and understand the application code. If you are uncomfortable doing that, no problem; simply decline to do a retrospective. I am happy to take on the task if desired.

That is an destination to go for, I agree. However abstraction is required because no one is able to have, at a given moment in time, the full knowledge of all elements involved in an application.

If you are ok to do a retrospective, we are happy to get some help and learn from that. I feel like that what we qualify as "retrospective" might feel too "superficial" (in the literal sense, e.g. not deep down) for you as we are not knowledgable enough on low level Jenkins core behavior (not mentioning its plugins).

But the retrospective must be with shades: we cannot think and act in absolutes, because despite the frustration, we are a team with different people whom work, think and learn differently: it's important to have everyone onboard on a retrospective and on the panel of task that should follow.

We need to […] add a lock to sequentialize these builds.

As stated in #2893 (comment), it is premature to come to a conclusion about what action to take without clearly defining the symptoms of the problem and its cause.

There is no plan to do this, as In expect any "retrospective" exercise to write down this as proposal to be challenged of course. But the infrastructure is clearly at risk: whatever long (and good) term solution to implement won't be done in short term and this proposal is in the area of "short term" mitigation. I agree it is "reaction" but that also how we have to deal to limit the "context switching" fatigue that we are facing with such outages.

basil commented 2 years ago

Yes abstraction is required, but with a firm grasp of the fundamentals one can zoom in and out from high-level to low-level as needed, learning the new abstractions just-in-time as one adjusts the zoom level. In my experience there is very little room for shade when it comes to analytical reasoning and root cause analysis. Absolute logic has served me well in that domain.

I actually started to get involved in the Jenkins community when finding the cause of a similar operational failure back in 2018. Amusingly, you can read my early (confused) emails about it here:

Eventually I was able to find the cause and submit a detailed bug report and test case, and Jesse Glick fixed the bug a year later in JENKINS-41854. There was much rejoicing at my company when that issue was resolved.

In any case I agree that we need to take some short to medium term action. I will think about what to do over the next week.

dduportal commented 2 years ago

You are totally right (in my own opinion) and the concern is improving the overall quality of Jenkins. If we hit that issue, for sure other users are.

I only want to add multiple ideas, with different time scopes, to the whole thing.

Let's work on a retrospective next week after dust have settled and everyone's mind was garbage collected by the week-end :)

Many thanks for the insights, the help and work!

lemeurherve commented 2 years ago

Noticed this PR, seems related to our problem: (Jenkins-68126) Remove watcher to fix Jenkins agents in suspended state after upgrade to 2.332.1 with kubernetes agents, queued builds not executing

dduportal commented 2 years ago

For info: post-mortem meeting by the team proposed (should happen wednesday the 03rd of May if everyone is available, otherwise in the upcoming days)

sbeaulie commented 2 years ago

@lemeurherve have you been able to try out our PR, it has been working well for us in production. We have also made other startup performance improvements and will open a separate PR.

sbeaulie commented 2 years ago

This also really helped our infra run things quickly https://github.com/jenkinsci/kubernetes-plugin/pull/1171

dduportal commented 2 years ago

Thanks for the feedback @sbeaulie ! Alas we haven't been able to try this in production. The main reason is that the root cause of the problem is now gone for us (Docker API rate limit, leading into a lot of retries).

(edit) I'm mentionning https://github.com/jenkinsci/kubernetes-plugin/pull/1171. But I forgot about https://github.com/jenkinsci/kubernetes-plugin/pull/1167: we'll see what we can.

dduportal commented 2 years ago

Draft of the post-mortem (collaborative notes that would be moved in GitHub once "frozen"): https://hackmd.io/8IGwo3QXSa-kcuPmKLDTQg.

Last call @basil @lemeurherve @smerle33 @MarkEWaite for review and approval: ETA Monday 30th of May.

dduportal commented 2 years ago

Hello @sbeaulie , we are interested in trying your PR https://github.com/jenkinsci/kubernetes-plugin/pull/1167. As soon as you'll be able to fix the tests / feedbacks, an incremental-build will be generated and be installablle on ci.jenkins.io. Many thanks!

jenkins-infra / helpdesk