Closed jtnord closed 1 year ago
One point to add on the Windows agents, we are currently running on ACI there and using the limit of what is available from my memory of it.
To get more resources we would need to deploy a windows node pool on a Kubernetes cluster, which is probably fine to do 😄
+UINT64_MAX
. I, too, have spent many hours writing various hacks to try and figure out what environment my build was running in to determine whether it was resource-constrained. It would be great to know, at the beginning of the build, what is not directly under my control:
-Xmx
and -Xms
settings for the JNLP JVM (or, if none were explicitly specified, what Java ergonomics decided on)From that point, I can take it from there, because the rest of the build is directly under my control (the Java -Xmx
and -Xms
settings for Maven and the Jenkins test harness spawned agents).
Since this issue is defined at a very high level, it might be worth splitting this into several sub-tasks. I am more accustomed to Jira, where projects can be organized into epics, issues, sub-tasks, etc. Not sure GitHub can do all of that. But these are some of the concrete action items I can think of:
pipeline-library
to display this information for our particular setup (seems like a maintenance burden to keep it up-to-date, though). If there is consensus on implementing this in the relevant plugins, I am happy to help on the development side.ci.jenkins.io
(especially bounding -Xmx
to a low value rather than relying on JVM ergonomics, which might waste a lot of RAM). This is specific to the Jenkins infrastructure. CloudBees likely has recommendations for the -Xmx
setting here, but I would imagine something like 256 MiB of RAM would suffice. This setting needs to be applied in a variety of configurations (e.g., Kubernetes agents, ACI agents, and one-shot VM agents).-Xmx
arguments for various Java JVMs). Who knows, maybe with each JVM tuned appropriately we could run some builds on 4 GiB of RAM instead of 8 GiB. Without observability, we'll never know!I fear we may be leaving money on the table due to lack of simple optimizations due to lack of observability to implement such optimizations. BTW I just used Datadog in the last point as an example; there are a variety of solutions for this type of thing. But whatever solution we choose, we need to provide access to authorized users (which may not be operators).
We're experimenting with milestones as epic tasks.
I haven't dealt with datadog and observability of the infra yet. I know there is some cleaning @dduportal has planned to do in this area before introducing it to me, but it's certainly something on our list. Looking forward to work on this, your help and knowledge would be more than welcome.
We have datadog sponsorship and also elastic cloud from what I know. both are probably options
elastic cloud
No yet, as far as I know, I'm talking with some people to move this forward.
elastic cloud
No yet, as far as I know, I'm talking with some people to move this forward.
Thanks @v1v ! @lemeurherve and I were discussing how to (re-start) the subject with you, as we (jenkins-infra) team clearly did not have time in 2021 to do this. Since the opentelemtry plugin is installed everywhere, there is not a lot of things for us to do in order to benefit from this :)
Recent issues where this could have helped:
Gotta start working on it with https://github.com/jenkins-infra/helpdesk/issues/2980 (scope: VMs only, because of ATH/BOM/Core often failing for highmem agents).
We could likely do it with OpenTelemetry ingestion to datadog: https://www.datadoghq.com/blog/ingest-opentelemetry-traces-metrics-with-datadog-agent/
as per https://github.com/jenkins-infra/helpdesk/issues/2980 we now have metrics but cannot have a simple way to display them, a datadog dashboard would be a solution but in public view it loose the selectors options so no way to specify an agent hostname.
As a temporary solution we will provide a new helpdesk issue kind to be able to provide information at demand : https://github.com/jenkins-infra/helpdesk/issues/new?assignees=&labels=metrics%2Ctriage&template=5-metrics.yml
As a long term solution we can dig around a grafana or other metric dashboard to use as a public side metrics information source that could provide at least cpu and memory information for a specific agent corresponding with a specific build.
Closing as we achieved a full observability in Datadog, which is usefult for sizing the infrastructure and checking errors/events.
Service
ci.jenkins.io
Summary
As a developer trying to correlate flaky tests or build performance I have no visibility into the underlying infrastructure to correlate failures.
things like 1) did my test fail because the CPU was starved 2) was memory an issue (was I consuming 100% of available memory and thrashing the GC) (link to above) 3) are the memory limits applied to builds appropriate or not 4) could I get CI results faster if I ran more forks (surefire) for the given hardware 5) is the hardware even moderately sized (hint it is not), or what should that size be?
e.g. https://github.com/jenkinsci/jenkins/pull/6236#issuecomment-1027107241 and related discussion
Reproduction steps