Open l-olsem opened 1 week ago
cc @mdealer would it be possible for you to take a look please?
What do you mean by OOM? Just a full heap is not an actual out of memory condition for garbage collected memory management.
Did Jenkins crash? If not, then it works as intended. SoftReferences are freed as soon as memory is required for something else, so test results remain in cache for as long as required. We see similar heap usage on our side too.
Yes Jenkins controller did indeed crash. The VM the Jenkins controller runs in has 64GiB of RAM. We have set a rather high limit for the heap, it was in the end not the JVM itself that threw a OutOfMemoryError but it was the system OOM killer.
Sep 15 21:33:54 jenkinscontroller-01 systemd[1]: trendiction-jenkins.service: A process of this unit has been killed by the OOM killer.
Sep 15 21:33:56 jenkinscontroller-01 systemd[1]: trendiction-jenkins.service: Main process exited, code=killed, status=9/KILL
Sep 15 21:33:56 jenkinscontroller-01 systemd[1]: trendiction-jenkins.service: Failed with result 'oom-kill'.
I do see the heap usage going up and down during the day for like you mention, but overall there is a tendency towards growing more and more slowly over the week until the machine is OOM.
So maybe it is a mistake to give it that much heap, (there is nothing running next to the jenkins controller inside the VM), so I though 60GiB should be ok. I am going to test with 50GiB of max heap and re-enable the junit cache and see what happens.
We use OpenJDK 17 on our side plus these JVM args:
-Xms32g
-Xmx64g
-XX:+UseG1GC
-XX:+AlwaysPreTouch
-XX:+UseStringDeduplication
-XX:+ParallelRefProcEnabled
-XX:+ExplicitGCInvokesConcurrent
Also, anecdotally, a while ago we determined that workspace locator crashes our instance in the same way by an OOM. We weren't able to find a correlation with resultCache at that time. Maybe something to try to rule this out by setting a system property:
jenkins.branch.WorkspaceLocatorImpl.Mode=jenkins.branch.WorkspaceLocatorImpl.Mode.DISABLED
The latter is of temporary nature, I think, as it might stop deleting workspaces on nodes.
Maybe you could get down to what kind of objects are actually piling by looking at a dump?
Thank you for sharing the flags and infos. I will check how to integrate them / consolidate them with what we currently have. Currently we do not set an explicit GC, so by default we get G1.
The JVM should be container aware(we are running inside lxc container), but only if the cgroup setup is proper I guess
import java.lang.management.ManagementFactory
def osBean = ManagementFactory.getOperatingSystemMXBean()
println "OS Total Physical Memory: ${osBean.getTotalPhysicalMemorySize() / (1024 * 1024)} MB"
// output: OS Total Physical Memory: 128717.92578125 MB
This is the hosts ram (128GiB), so that might be fishy.
Also, java -XshowSettings:system -version
shows Memory Limit: Unlimited
but not sure if that is relevant.
My current hypothesis is that we are lacking GC pressure and the JVM was happy to trade GC time vs memory usage (since host memory was doing fine)
Shortly before Jenkins was killed by the oom-killer, i manually triggered a heapdump creation and ran it through Eclipse Memory Analyzer using the leak report. Both reports point to the ConcurrentHashMap of the resultCache.
It is unclear to me how meaningful these reports are given that the JVM might be misjudging the free ram of the system it is running on.
In the Groovy script console, you could execute this to see how bad the map grows:
println hudson.tasks.junit.TestResultAction.resultCache.size()
And you could check what test results are actually there:
hudson.tasks.junit.TestResultAction.resultCache.each {
println it.key
}
return
And you could also run this to clear it:
hudson.tasks.junit.TestResultAction.resultCache.clear()
Also, check that the cleanup of empty SoftReferences is actually called, this value should change every now and then by itself:
println hudson.tasks.junit.TestResultAction.lastCleanupNs
Could help to get closer to the root cause of this.
I began to suspect the size of the cache results to be the problem so I changed the config to the following, expecting this to be also be a working solution in our setup
hudson.tasks.junit.TestResultAction.resultCache.clear()
hudson.tasks.junit.TestResultAction.LARGE_RESULT_CACHE_THRESHOLD = 100
hudson.tasks.junit.TestResultAction.RESULT_CACHE_ENABLED = true
However to my surprise after couple of hours I ran this
println("RESULT_CACHE_ENABLED: ${hudson.tasks.junit.TestResultAction.RESULT_CACHE_ENABLED}")
println("LARGE_RESULT_CACHE_THRESHOLD: ${hudson.tasks.junit.TestResultAction.LARGE_RESULT_CACHE_THRESHOLD}")
println("Current resultCache size: ${hudson.tasks.junit.TestResultAction.resultCache.size()}")
println ("lastCleanupNs: ${hudson.tasks.junit.TestResultAction.lastCleanupNs}")
println("diff: ${System.nanoTime() - hudson.tasks.junit.TestResultAction.lastCleanupNs}")
Sample output
RESULT_CACHE_ENABLED: true
LARGE_RESULT_CACHE_THRESHOLD: 100
Current resultCache size: 375
lastCleanupNs: 115422569121994
diff: 2220287921
The surprise is that LARGE_RESULT_CACHE_THRESHOLD is not respect apparently.
Using a job where I know that rarely anybody is opening the testReport page, I can confirm that when i open the testReport then lastCleanupNs updates to a new value and the resultCache size increases by one. I was able to repeat this a couple of times using different builds. Refreshing the page did not make the counter increase.
Now the line if (v.get() == null) {
looks suspicious to me. It does not seem to 'ever' be true for us.
I do not see the reason for this null check beyond wanting to remove the key.
Is the code that should make resultCache respect LARGE_RESULT_CACHE_THRESHOLD missing or is it expected for the TestResult reference to become null at some point?
SoftReferences are nullified by the GC when it frees the contents. The GC decides this on its own, there is no hard limit. If the GC never does that for any reason, then the cache keeps growing.
LARGE_RESULT_CACHE_THRESHOLD is not a cleanup of the cache, but a catch-up with GC operations, there is no hard limit here, just the available memory and GC. The catch-up is pointless if the map is small enough (the taken memory being less of a burden than iterating through the map).
So, we need to find out why is GC not doing its job for you. Can you reduce the memory further to see what changes in behavior?
EDIT: In the meantime I will check how to extend the cleanup procedure, because in theory, as long as nothing touches the test results, the map could be growing (there is a put operation without cleanup). Though, your findings contradict this, because by disabling resultCache the put is not skipped.
I am not sure if the disabling of the resultCache for sure fixed it, i would need to wait a couple of days to 100% confirm that.
EDIT: What I know is that resultCache.clear() gave me back the memory.
I tried a System.gc()
just now, it did not help, the resultCache size remained the same, I will take another heapdump to try to figure out what is holding on to the TestResult.
Note that since i opened the ticket i have not changed any JVM settings, I currently only have prod available to test :-)
Try this to check if it is full of nulls or not:
hudson.tasks.junit.TestResultAction.resultCache.each {
println it.value.get()
}
return
No, not a single null, getting a long list of the kind java.lang.ref.SoftReference@25a64b02
I forgot .get()
at the end...
Still not a single null, lots of the kind hudson.tasks.junit.TestResult@5aeee99c
Observe this for some time and report back.
When i go to a specific entry of TestResult in the heapdump and list all incoming references i get this
Which maybe looks normal, I do not know.
What confuses me is what i see when i open resultCache
I am not sure how to interpret the mention of class TestResultAction after opening the dropdown
The reference from EntrySetView could be the .forEach that cleans up the nulls, but all of these should be SoftReference, so unless there is a bug in this specific constellation with this GC, not sure what to try other than the VM options.
We run tests until the disk runs out of space on our side, need to dig a bit more on your setup. Next to try would be the JVM args and system props plus observations.
Also, is it possible that we run into https://bugs.openjdk.org/browse/JDK-8192647 by a coincidence like those workspace locator threads?
Jenkins and plugins versions report
Environment
```text Jenkins: 2.462.1 OS: Linux - 6.8.4-3-pve Java: 21.0.2 -What Operating System are you using (both controller, and any agents involved in the problem)?
Controller: Ubuntu 20.04.6 LTS Agents: Ubuntu 20.04.6 LTS
Reproduction steps
Having the resultCache of TestResultAction enabled caused a slow memory build up for us. The problem seems to have started 3-4 weeks ago when we probably updated plugins along with a needed restart we had to do.
From then on, in a weeks time, more and more heap was being used until we were exhausting the 60GiB max heap.
Expected Results
Memory usage should not steadily increase over time.
Actual Results
Memory steadily building up until OOM Note that the smaller hile in the middle was due to an unrelated controller restart.
Sadly I cannot recall (or know where to lookup) what changed around 3,4 weeks ago when the memory build up started to happen. But it looks like a recent change caused this.
Anything else?
Clearing the resultCache (resultCache.clear()) freed up the 'lost' memory
We disabled the cache via
hudson.tasks.junit.TestResultAction.RESULT_CACHE_ENABLED
and from now (26/09/2024) on , it looks like the memory build up is not happening anymore.Are you interested in contributing a fix?
Potentially yes, if can find some time, not an expert in Java though, so I might no be able to identify the source of the leak.