Add observability for the build agents

jtnord commented 2 years ago

Service

ci.jenkins.io

Summary

As a developer trying to correlate flaky tests or build performance I have no visibility into the underlying infrastructure to correlate failures.

things like 1) did my test fail because the CPU was starved 2) was memory an issue (was I consuming 100% of available memory and thrashing the GC) (link to above) 3) are the memory limits applied to builds appropriate or not 4) could I get CI results faster if I ran more forks (surefire) for the given hardware 5) is the hardware even moderately sized (hint it is not), or what should that size be?

e.g. https://github.com/jenkinsci/jenkins/pull/6236#issuecomment-1027107241 and related discussion

Reproduction steps

run a build in CI (e.g. by creating a PR to a plugin or Jenkins core
try and see if there is CPU to spare to make it faster (or reduce the amount of CPU an agent needs)
try and see if there is memory to spare to make it faster (or reduce the amount of memory an agent needs)
try and correlate a test failure with resource constraints (at the time the test was running was the cpu resources at max capacity

timja commented 2 years ago

One point to add on the Windows agents, we are currently running on ACI there and using the limit of what is available from my memory of it.

To get more resources we would need to deploy a windows node pool on a Kubernetes cluster, which is probably fine to do 😄

basil commented 2 years ago

+UINT64_MAX. I, too, have spent many hours writing various hacks to try and figure out what environment my build was running in to determine whether it was resource-constrained. It would be great to know, at the beginning of the build, what is not directly under my control:

VM CPU/memory (or container limits, if not a VM)
-Xmx and -Xms settings for the JNLP JVM (or, if none were explicitly specified, what Java ergonomics decided on)

From that point, I can take it from there, because the rest of the build is directly under my control (the Java -Xmx and -Xms settings for Maven and the Jenkins test harness spawned agents).

basil commented 2 years ago

Since this issue is defined at a very high level, it might be worth splitting this into several sub-tasks. I am more accustomed to Jira, where projects can be organized into epics, issues, sub-tasks, etc. Not sure GitHub can do all of that. But these are some of the concrete action items I can think of:

Displaying the VM CPU/memory (or container limits, if not a VM). We might opt to implement this in the relevant plugins themselves (seems ideal), or otherwise provide a global shared variable in pipeline-library to display this information for our particular setup (seems like a maintenance burden to keep it up-to-date, though). If there is consensus on implementing this in the relevant plugins, I am happy to help on the development side.
Setting appropriate values for the JNLP JVM used to connect to ci.jenkins.io (especially bounding -Xmx to a low value rather than relying on JVM ergonomics, which might waste a lot of RAM). This is specific to the Jenkins infrastructure. CloudBees likely has recommendations for the -Xmx setting here, but I would imagine something like 256 MiB of RAM would suffice. This setting needs to be applied in a variety of configurations (e.g., Kubernetes agents, ACI agents, and one-shot VM agents).
Provide visibility into what happened during a run. Let me use Datadog as an example, since I am familiar with it. With Datadog, you can see a graph of CPU and memory usage over time as well as get the breakdown of what processes were running at a point in time in the past. With such a view, we could quantify the resource utilization of a given job (like core nd BOM builds), and with the historical data we can possibly optimize the job (e.g. by setting -Xmx arguments for various Java JVMs). Who knows, maybe with each JVM tuned appropriately we could run some builds on 4 GiB of RAM instead of 8 GiB. Without observability, we'll never know!

I fear we may be leaving money on the table due to lack of simple optimizations due to lack of observability to implement such optimizations. BTW I just used Datadog in the last point as an example; there are a variety of solutions for this type of thing. But whatever solution we choose, we need to provide access to authorized users (which may not be operators).

lemeurherve commented 2 years ago

We're experimenting with milestones as epic tasks.

I haven't dealt with datadog and observability of the infra yet. I know there is some cleaning @dduportal has planned to do in this area before introducing it to me, but it's certainly something on our list. Looking forward to work on this, your help and knowledge would be more than welcome.

timja commented 2 years ago

We have datadog sponsorship and also elastic cloud from what I know. both are probably options

v1v commented 2 years ago

elastic cloud

No yet, as far as I know, I'm talking with some people to move this forward.

dduportal commented 2 years ago

elastic cloud

No yet, as far as I know, I'm talking with some people to move this forward.

Thanks @v1v ! @lemeurherve and I were discussing how to (re-start) the subject with you, as we (jenkins-infra) team clearly did not have time in 2021 to do this. Since the opentelemtry plugin is installed everywhere, there is not a lot of things for us to do in order to benefit from this :)

dduportal commented 2 years ago

Recent issues where this could have helped:

Gotta start working on it with https://github.com/jenkins-infra/helpdesk/issues/2980 (scope: VMs only, because of ATH/BOM/Core often failing for highmem agents).

timja commented 2 years ago

We could likely do it with OpenTelemetry ingestion to datadog: https://www.datadoghq.com/blog/ingest-opentelemetry-traces-metrics-with-datadog-agent/

https://github.com/jenkinsci/opentelemetry-plugin

smerle33 commented 2 years ago

as per https://github.com/jenkins-infra/helpdesk/issues/2980 we now have metrics but cannot have a simple way to display them, a datadog dashboard would be a solution but in public view it loose the selectors options so no way to specify an agent hostname.

As a temporary solution we will provide a new helpdesk issue kind to be able to provide information at demand : https://github.com/jenkins-infra/helpdesk/issues/new?assignees=&labels=metrics%2Ctriage&template=5-metrics.yml
As a long term solution we can dig around a grafana or other metric dashboard to use as a public side metrics information source that could provide at least cpu and memory information for a specific agent corresponding with a specific build.

dduportal commented 1 year ago

Closing as we achieved a full observability in Datadog, which is usefult for sizing the infrastructure and checking errors/events.

jenkins-infra / helpdesk