jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
17 stars 10 forks source link

Add observability for the build agents #2769

Closed jtnord closed 1 year ago

jtnord commented 2 years ago

Service

ci.jenkins.io

Summary

As a developer trying to correlate flaky tests or build performance I have no visibility into the underlying infrastructure to correlate failures.

things like 1) did my test fail because the CPU was starved 2) was memory an issue (was I consuming 100% of available memory and thrashing the GC) (link to above) 3) are the memory limits applied to builds appropriate or not 4) could I get CI results faster if I ran more forks (surefire) for the given hardware 5) is the hardware even moderately sized (hint it is not), or what should that size be?

e.g. https://github.com/jenkinsci/jenkins/pull/6236#issuecomment-1027107241 and related discussion

Reproduction steps

  1. run a build in CI (e.g. by creating a PR to a plugin or Jenkins core
  2. try and see if there is CPU to spare to make it faster (or reduce the amount of CPU an agent needs)
  3. try and see if there is memory to spare to make it faster (or reduce the amount of memory an agent needs)
  4. try and correlate a test failure with resource constraints (at the time the test was running was the cpu resources at max capacity
timja commented 2 years ago

One point to add on the Windows agents, we are currently running on ACI there and using the limit of what is available from my memory of it.

To get more resources we would need to deploy a windows node pool on a Kubernetes cluster, which is probably fine to do 😄

basil commented 2 years ago

+UINT64_MAX. I, too, have spent many hours writing various hacks to try and figure out what environment my build was running in to determine whether it was resource-constrained. It would be great to know, at the beginning of the build, what is not directly under my control:

From that point, I can take it from there, because the rest of the build is directly under my control (the Java -Xmx and -Xms settings for Maven and the Jenkins test harness spawned agents).

basil commented 2 years ago

Since this issue is defined at a very high level, it might be worth splitting this into several sub-tasks. I am more accustomed to Jira, where projects can be organized into epics, issues, sub-tasks, etc. Not sure GitHub can do all of that. But these are some of the concrete action items I can think of:

I fear we may be leaving money on the table due to lack of simple optimizations due to lack of observability to implement such optimizations. BTW I just used Datadog in the last point as an example; there are a variety of solutions for this type of thing. But whatever solution we choose, we need to provide access to authorized users (which may not be operators).

lemeurherve commented 2 years ago

We're experimenting with milestones as epic tasks.

I haven't dealt with datadog and observability of the infra yet. I know there is some cleaning @dduportal has planned to do in this area before introducing it to me, but it's certainly something on our list. Looking forward to work on this, your help and knowledge would be more than welcome.

timja commented 2 years ago

We have datadog sponsorship and also elastic cloud from what I know. both are probably options

v1v commented 2 years ago

elastic cloud

No yet, as far as I know, I'm talking with some people to move this forward.

dduportal commented 2 years ago

elastic cloud

No yet, as far as I know, I'm talking with some people to move this forward.

Thanks @v1v ! @lemeurherve and I were discussing how to (re-start) the subject with you, as we (jenkins-infra) team clearly did not have time in 2021 to do this. Since the opentelemtry plugin is installed everywhere, there is not a lot of things for us to do in order to benefit from this :)

dduportal commented 2 years ago

Recent issues where this could have helped:

Gotta start working on it with https://github.com/jenkins-infra/helpdesk/issues/2980 (scope: VMs only, because of ATH/BOM/Core often failing for highmem agents).

timja commented 2 years ago

We could likely do it with OpenTelemetry ingestion to datadog: https://www.datadoghq.com/blog/ingest-opentelemetry-traces-metrics-with-datadog-agent/

https://github.com/jenkinsci/opentelemetry-plugin

smerle33 commented 2 years ago

as per https://github.com/jenkins-infra/helpdesk/issues/2980 we now have metrics but cannot have a simple way to display them, a datadog dashboard would be a solution but in public view it loose the selectors options so no way to specify an agent hostname.

dduportal commented 1 year ago

Closing as we achieved a full observability in Datadog, which is usefult for sizing the infrastructure and checking errors/events.