opendevstack / ods-jenkins-shared-library

Shared Jenkins library which all ODS projects & components use - provisioning, SonarQube code scanning, Nexus publishing, OpenShift template based deployments and repository orchestration
Apache License 2.0
74 stars 57 forks source link

releasemanager (with EDP, JIRA) . memory used/consumed during run is not reclaimed / freed up after run #857

Open clemensutschig opened 2 years ago

clemensutschig commented 2 years ago

We run jenkins with EDP - for document generation - and can see that memory used during the run is not reclaimed / freed up after

image

the run started at 5:44pm - and was done at 11pm CET - you can see - no GC // reclaim

@braisvq1996 @albertpuente

metmajer commented 2 years ago

@clemensutschig we looked into this. What we detected is that the available memory is immediately consumed upon start, starting with 10 GB and quickly going up to 14 GB. We also see regular garbage collector activity and that Jenkins' underlying Hudson implementation keeps up to 5 M references to ArrayList$Itr objects, which are small though and only sum up to roughly 150 MB.

We do see regular suspension times induced by the GC, which is related to taking objects from the Eden Space into the Old Gen. I assume that our configuration has something to do with this, but it came as a recommendation from CloudBees, see https://github.com/opendevstack/ods-core/issues/412.

metmajer commented 2 years ago
Screenshot 2022-03-02 at 08 25 47 Screenshot 2022-03-02 at 08 24 10 Screenshot 2022-03-02 at 08 26 42 Screenshot 2022-03-02 at 08 26 27 Screenshot 2022-03-02 at 08 25 17 Screenshot 2022-03-02 at 08 24 56
metmajer commented 2 years ago

@s2oBCN @victorpablosceruelo @angelmp01 will the ongoing refactorings of the LeVA DocGen service affect memory consumption?

clemensutschig commented 2 years ago

hey @metmajer this is interesting - do we have a view what is stored in this extensionLists ... ? I took a look what's in the doc about this .. maybe here is our culprit .. .?!

metmajer commented 2 years ago

@clemensutschig yes, might make a difference, but could depend on the GC. Let's not take for granted what CloudBees recommended some time ago. The following article might be interesting: https://developers.redhat.com/articles/2021/11/02/how-choose-best-java-garbage-collector

victorpablosceruelo commented 2 years ago

@s2oBCN @victorpablosceruelo @angelmp01 will the ongoing refactorings of the LeVA DocGen service affect memory consumption?

Hi there. Yes, it should affect the memory consumption because we will stop using some SharedLibrary functionality (that will be removed later), but we do not know yet if it will really be the improvement we are looking for.

We are requesting access to the monitor tools you show in this issue. The more details we have, the more concise will be our answer.

clemensutschig commented 2 years ago

image

4.x with a super simple testcase (RM + Jira + one (failing) test component - it's not a lot per run .. but it's definitely accumulating .. the start of the graph is a restart of jenkins - to have a clean base .. and then one run after another ..

20 runs - one gig of RAM that is not reclaimed ..

metmajer commented 2 years ago

Here is an overview on the Jenkins configuration we are currently using: https://github.com/opendevstack/ods-core/issues/412#issuecomment-601589479

clemensutschig commented 2 years ago

even with keep only 2 failures (via build config, and hence deleting the rest) - there is no difference: image

we see the same growth .. if you zoom in - you see GC kicking in - and reclaiming 10 or less megs after every run - but overall we see (close to) linear increase. This leads me (for now) to believe that there is a mem leak,

clemensutschig commented 2 years ago

even with a whole night - no real GC reclaim happend ... I am trying now to get this thing to an oom

image

clemensutschig commented 2 years ago

hurray :)

Terminated at Mar 9, 2022 1:40:55 PM with exit code 137 (OOMKilled) (5103megs)
metmajer commented 2 years ago

@clemensutschig you're the first person that's happy about an OOM :) Are you able to achieve better insights? Thinking the the GC options might need a re-evaluation.

clemensutschig commented 2 years ago

I am trying to get better insights - and cleanup all sorts of things now in my branch - to see if that impacts anything ..

what's interesting - I have turned on cleanup (thru build config) of failed builds .. and take a look below:

rg.jenkinsci.plugins.workflow.job.WorkflowRun finish testbug-cd/testbug-cd-release-master #185 completed: FAILURE
--
  | 2022-03-09 13:03:05 INFO    io.fabric8.jenkins.openshiftsync.BuildSyncRunListener onCompleted onCompleted job/testbug-cd/job/testbug-cd-release-master/185/
  | 2022-03-09 13:03:05 INFO    io.fabric8.jenkins.openshiftsync.JenkinsUtils deleteRun Deleting run: testbug-cd/testbug-cd-release-master #183
  | 2022-03-09 13:03:05 INFO    io.fabric8.jenkins.openshiftsync.BuildSyncRunListener onFinalized onFinalized job/testbug-cd/job/testbug-cd-release-master/185/
  | 2022-03-09 13:03:05 INFO    io.fabric8.jenkins.openshiftsync.BuildSyncRunListener onDeleted onDeleted job/testbug-cd/job/testbug-cd-release-master/183/
  | 2022-03-09 13:03:05 INFO    org.csanchez.jenkins.plugins.kubernetes.KubernetesClientProvider gracefulClose Not closing io.fabric8.kubernetes.client.DefaultKubernetesClient@76b3c497: there are still running (1) or queued (0) calls
  | 2022-03-09 13:03:05 WARNING io.fabric8.jenkins.openshiftsync.JenkinsUtils deleteRun Unable to delete run testbug-cd/testbug-cd-release-master #183:Unable to delete '/var/lib/jenkins/jobs/testbug-cd/jobs/testbug-cd-release-master/builds/.183'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
  | 2022-03-09 13:05:35 INFO    org.csanchez.jenkins.plugins.kubernetes.KubernetesClientProvider gracefulClose Not closing io.fabric8.kubernetes.client.DefaultKubernetesClient@76b3c497: there are still running (1) or queued (0) calls
clemensutschig commented 2 years ago

ok - more tries (only keeping 2 failed runs, which the belows are): All data is Container memory based on ocp console::

restart: 3246m mem used 17:27: start FIRST no JIRA NO DOCS rm run 17:32: run done - 3440m mem used (+155m)

17:43: start SECOND no JIRA NO DOCS rm run (agent pods were removed) 17:49: run done - 3515m mem used (+75m) 17:58: 3510m mem used (GC: - 4m)

17:59: start THIRD no JIRA NO DOCS rm run (agent pods were removed) 18:06: run done - 3565m mem used (+50m) 18:15: 3561m mem used (GC: - 4m)

18:15: start FOURTH no JIRA NO DOCS rm run (agent pods were removed) 18:21: run done - 3634 m mem used (+70m)

I am trying to understand why this is steadly growing - and I really CANNOT .. .

clemensutschig commented 2 years ago

ps - in case someone curious - I added

    } finally {
      logger.debug('-- SHUTTING DOWN RM (..) --')
      logger.resetStopwatch()
      project.clear()
      ServiceRegistry.instance.clear()
      UnirestConfig.shutdown()
      project = null
      git = null
      logger = null
      repos = null
      steps = null
    }      

to the odsOrchestrationStage - at the almost very root ... :( and still nothing changes

clemensutschig commented 2 years ago

I am really completely utterly confused ... : xmx = 1g (top shows java 1gig mem) and I am running happily component jobs ..

however my full container? memory is still growing ... ?! (container memory set min4 max5g) - 1pod and ONE container only

image

clemensutschig commented 2 years ago

image

how come java goes to 1.3?

clemensutschig commented 2 years ago

ok - now I used max 3g (limits 3gi) - with 0,5 heap percentage and -Xmx1g.

Top: 1,3g Java and overall pod memory at 1600m still growing ...

from the logs:

exec java -server -Xmx1g -Dfile.encoding=UTF8 -Djavamelody.displayed-counters=log,error -Djava.util.logging.config.file=/var/lib/jenkins/logging.properties -Duser.home=/var/lib/jenkins -Djavamelody.application-name=JENKINS -Dhudson.security.csrf.GlobalCrumbIssuerConfiguration.DISABLE_CSRF_PROTECTION=true -Djenkins.install.runSetupWizard=false -Dhudson.tasks.MailSender.SEND_TO_UNKNOWN_USERS=true -Dhudson.tasks.MailSender.SEND_TO_USERS_WITHOUT_READ=true -jar /usr/lib/jenkins/jenkins.war
--
  | Picked up JAVA_TOOL_OPTIONS: -XX:+UnlockExperimentalVMOptions -Dsun.zip.disableMemoryMapping=true
  | Running from: /usr/lib/jenkins/jenkins.war
  | webroot: EnvVars.masterEnvVars.get("JENKINS_HOME")
clemensutschig commented 2 years ago

VM flags (based on xms / xmx = 1g and metaspace 1g):

-XX:CICompilerCount=2 -XX:CompressedClassSpaceSize=1040187392 -XX:InitialHeapSize=1073741824 -XX:MaxHeapSize=1073741824 -XX:MaxMetaspaceSize=1048576000 -XX:MaxNewSize=357892096 -XX:MinHeapDeltaBytes=196608 -XX:NewSize=357892096 -XX:NonNMethodCodeHeapSize=5825164 -XX:NonProfiledCodeHeapSize=122916538 -XX:OldSize=715849728 -XX:ProfiledCodeHeapSize=122916538 -XX:ReservedCodeCacheSize=251658240 -XX:+SegmentedCodeCache -XX:+UnlockExperimentalVMOptions -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseFastUnorderedTimeStamps -XX:+UseSerialGC 

more diagnosis .. (I have one testcomponent Ods job that runs over and over again)

image

vs a couple minutes later

image

vs a couple mins later

image

stitakis commented 2 years ago

@clemensutschig I'd like to do an analysis to identify memory leaks, how can I get the heap dump file? Did you generate a heap dump?

clemensutschig commented 2 years ago

native memory tracking:

12:21 UTC Total: reserved=2773758KB +72012KB, committed=1508758KB +131592KB

12:30 UTC Total: reserved=2759677KB +57931KB, committed=1506845KB +129679KB

I am now forcing a maxOfRam with 90% .. lets see where that brings us .. see below - class is going up reserved and also committed - although its the exact same job running, from the exact same commit!

-                     Class (reserved=1287679KB +79246KB, committed=267303KB +87822KB)
                            (classes #31186 +4625)
                            (  instance classes #29931 +4546, array classes #1255 +79)
                            (malloc=7679KB +1422KB #178934 +31441)
                            (mmap: reserved=1280000KB +77824KB, committed=259624KB +86400KB)
                            (  Metadata:   )
                            (    reserved=231424KB +77824KB, committed=230184KB +78080KB)
                            (    used=188192KB +46996KB)
                            (    free=41992KB +31084KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=1048576KB, committed=29440KB +8320KB)
                            (    used=20731KB +3677KB)
                            (    free=8709KB +4643KB)
                            (    waste=0KB =0.00%)

the same 10 mins later:

-                     Class (reserved=1287869KB +79436KB, committed=267493KB +88012KB)
                            (classes #32246 +5685)
                            (  instance classes #30974 +5589, array classes #1272 +96)
                            (malloc=7869KB +1612KB #182833 +35340)
                            (mmap: reserved=1280000KB +77824KB, committed=259624KB +86400KB)
                            (  Metadata:   )
                            (    reserved=231424KB +77824KB, committed=230184KB +78080KB)
                            (    used=197988KB +56792KB)
                            (    free=32196KB +21288KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=1048576KB, committed=29440KB +8320KB)
                            (    used=21590KB +4536KB)
                            (    free=7850KB +3784KB)
                            (    waste=0KB =0.00%)

10 mins later:

                     Class (reserved=1296679KB +88246KB, committed=276303KB +96822KB)
                            (classes #35240 +8679)
                            (  instance classes #33927 +8542, array classes #1313 +137)
                            (malloc=8487KB +2230KB #193209 +45716)
                            (mmap: reserved=1288192KB +86016KB, committed=267816KB +94592KB)
                            (  Metadata:   )
                            (    reserved=239616KB +86016KB, committed=238376KB +86272KB)
                            (    used=223608KB +82412KB)
                            (    free=14768KB +3860KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=1048576KB, committed=29440KB +8320KB)
                            (    used=24045KB +6991KB)
                            (    free=5395KB +1329KB)
                            (    waste=0KB =0.00%)

and now 20 mins later - something starts to collect

-                     Class (reserved=1299973KB +91541KB, committed=279981KB +100501KB)
                            (classes #30457 +3896)
                            (  instance classes #29216 +3831, array classes #1241 +65)
                            (malloc=7685KB +1429KB #178768 +31275)
                            (mmap: reserved=1292288KB +90112KB, committed=272296KB +99072KB)
                            (  Metadata:   )
                            (    reserved=243712KB +90112KB, committed=242216KB +90112KB)
                            (    used=180605KB +39409KB)
                            (    free=61611KB +50703KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=1048576KB, committed=30080KB +8960KB)
                            (    used=20172KB +3118KB)
                            (    free=9908KB +5842KB)
                            (    waste=0KB =0.00%)

take a look at the class increase ... !!!! ( instance classes #29931 +4546, array classes #1255 +79) vs: ( instance classes #30974 +5589, array classes #1272 +96) vs: ( instance classes #33927 +8542, array classes #1313 +137) vs ÁFTER COLLECT: ( instance classes #29216 +3831, array classes #1241 +65)

what's interesting: metadata reserved=243712KB +90112KB grows steadily ...

clemensutschig commented 2 years ago

https://support.cloudbees.com/hc/en-us/articles/360029574172-Metaspace-Leak-Due-to-classes-not-being-cleaned-up

but unfortunately - we are using [Jenkins 2.289.1] :(

clemensutschig commented 2 years ago
                     Class (reserved=818996KB +191857KB, committed=352564KB +213617KB)
                            (classes #42887 +21410)
                            (  instance classes #41126 +20650, array classes #1761 +760)
                            (malloc=12084KB +7537KB #243483 +135258)
                            (mmap: reserved=806912KB +184320KB, committed=340480KB +206080KB)
                            (  Metadata:   )
                            (    reserved=303104KB +184320KB, committed=301568KB +184064KB)
                            (    used=253997KB +145475KB)
                            (    free=47571KB +38589KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=503808KB, committed=38912KB +22016KB)
                            (    used=28052KB +14529KB)
                            (    free=10860KB +7487KB)
                            (    waste=0KB =0.00%)

running classloader stats: VM.classloader_stats

Total = 6316                                                 43487  317526016  293961896  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

with now another run (with shared lib)

Total = 6749                                                 45205  331855872  307339496  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

and now withOUT shared lib import ´(just echo ...)

Total = 6763                                                 45212  331902976  307372080  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

and now with shared lib import (and echo .... )

Total = 6787                                                 45219  331950080  307417400  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

sooo this is a classloader reference leak - something hangs on to classloaders - and it reproduces W/O any shared lib - it just gets worse with shared libs ... :/

clemensutschig commented 2 years ago

ok - some more testing with classloader stats

after start (NO running jobs):

Total = 897                                                  21312  135653376  123512712  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

after running a naked job (NO shared lib - just echo)

Total = 1010                                                 22726  144394240  132021864  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

wait a few mins ....

Total = 1038                                                 22744  144835584  132366424  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

after running a job (with shared lib import and echo)

Total = 1139                                                 22870  145810432  133105960  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

running the same job again

Total = 1183                                                 22907  146402304  133509704  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)

this is really scary ... :(

clemensutschig commented 2 years ago

I started to play with the heap dump - and classloaders .... and all big classloaders have the common ancestor:

image

if I look into the classloader - this is definetely ours ... (with grapes being there ... )

image

but there is a second guy holding (the first is the finalizer of the vm that tries to clean it) - that starts to spark my interest ... (via incoming references)

image

this is an enum .. hmmmm

clemensutschig commented 2 years ago

The annoying thing is that in a second threaddump hours later these guys are gone ;-( so it must be something else ..

https://issues.jenkins.io/plugins/servlet/mobile#issue/JENKINS-50223

clemensutschig commented 2 years ago

Ok - findings:

a) 4g min/max request memory on container level do it - with the following settings b) -server -XX:NativeMemoryTracking=summary -XX:MaxRAMPercentage=90 -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:MaxMetaspaceSize=1g -XX:MetaspaceSize=256M and c) -Xms1024m -Xmx1g -Dgroovy.use.classvalue=true

I am running release manager now over and over . but for now - I could not break it :)

The default setup we provide is definitely NOT good (5g limit container / 4g xmx and 1g MaxMetaspace) - this is defintely not sized correctly ... in my setup (as above) I see ´

Top Java process: ~2.9g Java native memory: Total: reserved=2895963KB +245454KB, committed=1673107KB +309194KB

and container showing 4g .. (with max being 4g).

clemensutschig commented 2 years ago

still there is the leak - albeit less .. (we have 40+ of the below) - which in fact could be one per run image

per run ....

image

working assumption a) grape import (https://github.com/opendevstack/ods-core/blob/master/jenkins/master/configuration/init.groovy.d/ods-jenkins-shared-library.groovy#L87) that is globally done (and hence we guess in some weird rootloader, doing even more weird stuff ... :( ) b) enums that also hold on to classloader

@stitakis will try to refactor (b) while I am on (a) ...

clemensutschig commented 2 years ago

to go here: https://github.com/opendevstack/ods-core/blob/master/jenkins/master/ods-run.sh#L96

sh-4.2$ cat loadgrapes.sh 
#!/bin/sh

echo $HOME
ivyhome=/tmp/ivy

unzip -o $HOME/plugins/workflow-cps-global-lib.jpi -d $ivyhome 

java -cp $HOME/war/WEB-INF/lib/groovy-all-2.4.12.jar:$ivyhome/WEB-INF/lib/ivy-2.4.0.jar -Dgrape.config=$HOME/.groovy/grapeConfig.xml -Duser.home=$HOME groovy.ui.GroovyMain loadgrapes.groovy

rm -rf $ivyhome
sh-4.2$ cat loadgrapes.groovy 
println "INFO: Grabbing grapes to avoid race condition during parallel compilation ..."
groovy.grape.Grape.grab(group: 'com.vladsch.flexmark', module: 'flexmark-all', version: '0.60.2')
groovy.grape.Grape.grab(group: 'fr.opensagres.xdocreport', module: 'fr.opensagres.poi.xwpf.converter.core', version: '2.0.2')
groovy.grape.Grape.grab(group: 'fr.opensagres.xdocreport', module: 'fr.opensagres.poi.xwpf.converter.pdf', version: '2.0.2')
groovy.grape.Grape.grab(group: 'org.apache.pdfbox', module: 'pdfbox', version: '2.0.17')
groovy.grape.Grape.grab(group: 'org.apache.poi', module: 'poi', version: '4.0.1')
groovy.grape.Grape.grab(group: 'net.lingala.zip4j', module: 'zip4j', version: '2.1.1')
groovy.grape.Grape.grab(group: 'org.yaml', module: 'snakeyaml', version: '1.24')
groovy.grape.Grape.grab(group: 'com.konghq', module: 'unirest-java', version: '2.4.03', classifier: 'standalone')

println "Done"
sh-4.2$ 
clemensutschig commented 2 years ago

ok - update ... I think we have found the leak ... an assumption to be proven

we see circular references from Classes loaded via GroovyCleanClassloader (=> the RUN classloader) into AntClassloader (from CPS plugin) .. image

and image

e.g. one example:

OdsProjectMessagesUtil (c2746d78) -> CleanGroovyClassloader (ceeed928)  -> AntClassloader (c1339b18)
----references-------  -->    CPS Function (c2746e30)  -> AntClassloader (c1339b18)
----is referenced by   <--    ConstantBlock (cc525d40) -> AntClassloader (c1339b18)

hence even if the CleanGroovyClassloader is released the CPS function will still hold on to it thru the shared lib class

and ...

https://github.com/jenkinsci/workflow-cps-plugin/blob/f522e783e917bc0aa0513a29684285b710163905/src/test/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecutionMemoryTest.java

and https://github.com/jenkinsci/jenkins-test-harness/blob/master/src/main/java/org/jvnet/hudson/test/MemoryAssert.java (which is really scary ... )

what we cannot explain yet ... is if this is driven by the shared lib codebase or a generic jenkins "feature"

clemensutschig commented 2 years ago

image

clemensutschig commented 2 years ago

from yesterday night - to today - 14 more classloaders generated ..

sh-4.2$ jcmd 442 VM.classloader_stats | grep Clean | wc -l
54

native memory (clearly metadata ran out ... reserved=940032KB +819200KB, committed=937472KB +818176KB)):

Native Memory Tracking

Total: reserved=3602992KB +911641KB, committed=2574916KB +1183581KB

-                 Java Heap (reserved=1048576KB, committed=1048576KB)
                            (mmap: reserved=1048576KB, committed=1048576KB)

-                     Class (reserved=2008854KB +842907KB, committed=1077014KB +936091KB)
                            (classes #143195 +121354)
                            (  instance classes #136992 +116162, array classes #6203 +5192)
                            (malloc=28438KB +23707KB #527779 +416505)
                            (mmap: reserved=1980416KB +819200KB, committed=1048576KB +912384KB)
                            (  Metadata:   )
                            (    reserved=940032KB +819200KB, committed=937472KB +818176KB)
                            (    used=903669KB +792891KB)
                            (    free=33803KB +25285KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=1040384KB, committed=111104KB +94208KB)
                            (    used=99075KB +85519KB)
                            (    free=12029KB +8689KB)
                            (    waste=0KB =0.00%)

-                    Thread (reserved=74326KB -11361KB, committed=8150KB -1013KB)
                            (thread #72 -11)
                            (stack: reserved=73984KB -11308KB, committed=7808KB -960KB)
                            (malloc=260KB -40KB #434 -66)
                            (arena=82KB -13 #142 -22)

-                      Code (reserved=264544KB +12756KB, committed=234484KB +181164KB)
                            (malloc=16856KB +12756KB #71476 +54545)
                            (mmap: reserved=247688KB, committed=217628KB +168408KB)

-                        GC (reserved=114146KB +29300KB, committed=114146KB +29300KB)
                            (malloc=42326KB +29300KB #551760 +374043)
                            (mmap: reserved=71820KB, committed=71820KB)

-                  Compiler (reserved=867KB +355KB, committed=867KB +355KB)
                            (malloc=734KB +355KB #4206 +2749)
                            (arena=133KB #5)

-                  Internal (reserved=4563KB +3144KB, committed=4563KB +3144KB)
                            (malloc=4531KB +3144KB #6287 +3466)
                            (mmap: reserved=32KB, committed=32KB)

-                     Other (reserved=418KB +115KB, committed=418KB +115KB)
                            (malloc=418KB +115KB #77 +30)

-                    Symbol (reserved=31401KB +11391KB, committed=31401KB +11391KB)
                            (malloc=29124KB +11263KB #481366 +223133)
                            (arena=2278KB +128 #1)

-    Native Memory Tracking (reserved=26669KB +17611KB, committed=26669KB +17611KB)
                            (malloc=89KB +65KB #1152 +847)
                            (tracking overhead=26580KB +17546KB)

-        Shared class space (reserved=17024KB, committed=17024KB)
                            (mmap: reserved=17024KB, committed=17024KB)

-               Arena Chunk (reserved=179KB -4638KB, committed=179KB -4638KB)
                            (malloc=179KB -4638KB)

-                   Logging (reserved=4KB, committed=4KB)
                            (malloc=4KB #192)

-                 Arguments (reserved=19KB, committed=19KB)
                            (malloc=19KB #519)

-                    Module (reserved=9318KB +8280KB, committed=9318KB +8280KB)
                            (malloc=9318KB +8280KB #37988 +32665)

-              Synchronizer (reserved=2075KB +1782KB, committed=2075KB +1782KB)
                            (malloc=2075KB +1782KB #17628 +15177)

-                 Safepoint (reserved=8KB, committed=8KB)
                            (mmap: reserved=8KB, committed=8KB)`

heap dump here we come ...

clemensutschig commented 2 years ago

I am now setting "aggressive" collection on soft references ... although our heap is super small (1g) - maybe here lies the problem (XX:SoftRefLRUPolicyMSPerMB=0)

after start and a pause, no GreenGroovyClassLoader as expected:

sh-4.2$ jcmd 441 VM.classloader_stats | grep Clean | wc -l
0

after 11 runs _ loaders stay alive :(:

sh-4.2$ jcmd 441 VM.classloader_stats | grep Clean | wc -l
11

after 1 run with the shared lib (but only returning success)

sh-4.2$ jcmd 441 VM.classloader_stats | grep Clean | wc -l
12
sh-4.2$ 
clemensutschig commented 2 years ago

Ok more testing ... a) I tried with a simple testcases (Actually a battery, simulating what we do in the shared lib ... nothing of this keeps the classloader dangling) b) Our shared lib implementation MUST have something special I have not found out yet - that causes the classloader to hang ... :(

@stitakis

clemensutschig commented 2 years ago

ok - quick update - I get it reproduced with ods component pipeline ONLY (on 4.x) ... the key is to have each run having a new commit (e.g. dummy on the jenkinsfile on the component! - not the shared lib) - multiple runs of the same commit of the Component REPO do NOT create hanging classloaders

this also explains why we see it also in the RM ... (because RM runs create new commits, so every run has new commits)

This is because of the Jenkins stage findImageOr ... so remove that..

clemensutschig commented 2 years ago

more updates - (on our testfix branch)

if someone wants to play ... here is what I use to get the classloaders: sh-4.2$ jcmd $(jps | cut -d ' ' -f1) VM.classloader_stats | grep Clean

The test I did was to add one by one stage (and 10 runs each) a) stageBuild (nothing) b) odsComponentStageBuildOpenShiftImage (nothing) c) odsComponentStageScanWithSonar (dangling loaders)

the odsDeploymentStage is interesting as well - it looks like there is a classloader dangling after the run - and a new run (of the same job def runs later - that loader is sudenly freed up ...)

@stitakis @michaelsauter @oalyman

jafarre-bi commented 2 years ago

A path to a GC-root: image

clemensutschig commented 2 years ago

quick update after the two weeks (it seems we have found / cleared the CpsGroovyCleanClassloader leak) - through cleaning up weak-references ourselves (which is actually what happens inside jenkins as well 👎 ) ...

I let run now overnight - to see what if anything else is leaking

for history - after 8 full EDP RM runs now:´

https://docs.oracle.com/en/java/javase/14/docs/specs/man/jcmd.html

VM.native_memory summary.diff (I created a baseline after restart of Jenkins)

Native Memory Tracking:

Total: reserved=2852887KB +165708KB, committed=1602271KB +218768KB

-                     Class (reserved=1251828KB +88274KB, committed=235820KB +98058KB)
                            (classes #31047 +9815)
                            (  instance classes #29679 +9440, array classes #1368 +375)
                            (malloc=8692KB +4306KB #192405 +86623)
                            (mmap: reserved=1243136KB +83968KB, committed=227128KB +93752KB)
                            (  Metadata:   )
                            (    reserved=202752KB +83968KB, committed=201016KB +84024KB)
                            (    used=171380KB +62814KB)
                            (    free=29636KB +21210KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=1040384KB, committed=26112KB +9728KB)
                            (    used=19575KB +6399KB)
                            (    free=6537KB +3329KB)
                            (    waste=0KB =0.00%)

-                    Thread (reserved=136296KB +49575KB, committed=15728KB +6359KB)
                            (thread #132 +48)
                            (stack: reserved=135664KB +49344KB, committed=15096KB +6128KB)
                            (malloc=479KB +175KB #794 +288)
                            (arena=153KB +56 #262 +96)

after 20 more runs: still only two classloaders!!!! - so the first part seems to be fixed

Total: reserved=2794761KB +107583KB, committed=1613785KB +230283KB

-                     Class (reserved=1250947KB +87394KB, committed=234939KB +97178KB)
                            (classes #28820 +7588)
                            (  instance classes #27503 +7264, array classes #1317 +324)
                            (malloc=7811KB +3426KB #184590 +78808)
                            (mmap: reserved=1243136KB +83968KB, committed=227128KB +93752KB)
                            (  Metadata:   )
                            (    reserved=202752KB +83968KB, committed=201016KB +84024KB)
                            (    used=161970KB +53404KB)
                            (    free=39046KB +30620KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=1040384KB, committed=26112KB +9728KB)
                            (    used=18108KB +4931KB)
                            (    free=8005KB +4797KB)
                            (    waste=0KB =0.00%)

-                    Thread (reserved=74326KB -12394KB, committed=8174KB -1194KB)
                            (thread #72 -12)
                            (stack: reserved=73984KB -12336KB, committed=7832KB -1136KB)
                            (malloc=260KB -44KB #434 -72)
                            (arena=82KB -14 #142 -24)

So NO increase in Metaspace now ... this is really really really awesome!!!! ... I will run 20 more runs overnight - to see if that drives this up or not ...

Starting point now: 2940m container memory, with top showing 2.3g reserved for the java process

clemensutschig commented 2 years ago

20 more runs over night, bringing this to 40+ runs - and still only 2 classloaders ;)

Total: reserved=2800040KB +112862KB, committed=1625608KB +242106KB

-                 Java Heap (reserved=1048576KB, committed=1048576KB)
                            (mmap: reserved=1048576KB, committed=1048576KB)

-                     Class (reserved=1251053KB +87500KB, committed=235045KB +97284KB)
                            (classes #28823 +7591)
                            (  instance classes #27504 +7265, array classes #1319 +326)
                            (malloc=7917KB +3532KB #188616 +82834)
                            (mmap: reserved=1243136KB +83968KB, committed=227128KB +93752KB)
                            (  Metadata:   )
                            (    reserved=202752KB +83968KB, committed=201016KB +84024KB)
                            (    used=163592KB +55026KB)
                            (    free=37424KB +28998KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=1040384KB, committed=26112KB +9728KB)
                            (    used=18102KB +4926KB)
                            (    free=8010KB +4802KB)
                            (    waste=0KB =0.00%)

and only very mild increase .. (on the entire memory)

the java process has grown to 2.5gigs .. so there is still maybe something around to look at ... e.g. filehandles or so..

and now compare this to https://github.com/opendevstack/ods-jenkins-shared-library/issues/857#issuecomment-1073192872 :)

clemensutschig commented 2 years ago

Here is the data that led to the fix... the Culprit as suspected is the CleanGroovyClassLoader

image

image

image

my best take and guess for now ... is that the shared lib combo of

  1. grape loading (the static cache as shown above)
  2. Unirest / Kong usage (dynamic types, enums)
  3. PDF conversion (introducing this formatter thing) together with the shared lib architecture is causing this ...

The main piece of the fix (cleanup) within the RM is here https://github.com/opendevstack/ods-jenkins-shared-library/pull/878/files#diff-2392f077f3287c32515e3323fb9de2cf0033d5cea706d1a1e5de20f5a8b3d2fcR128

Open topics:

  1. what do we do in the odsComponentPipeline only case - after the cleanup - you cannot run other stages anymore ... introduce maybe a flag?
  2. cleanup :) it's hack in progress
  3. find the other *6! megabytes we lost :)
  4. check why the container memory is still going up .. although I am less worried now - see https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-part-3-container-resource-metrics-361c5ee46e66
  5. do we need the special grape loading change I did (vs. initgroovy.d) - https://github.com/opendevstack/ods-jenkins-shared-library/issues/857#issuecomment-1068347327
  6. more testing, run another 100+ EDP rm jobs - can bring this beast now out of memory (internal project: testbug-cd) or are we really thru (we have never had 48runs w/o going oom)
  7. Change the failing test in testbug-thefirst to work - and run successfull pipelines and see if the mem leak shows up again
  8. Change the durability - see below - back to full ... (although I do NOT think we need it ... )
clemensutschig commented 2 years ago

@oalyman @michaelsauter @stitakis @metmajer

clemensutschig commented 2 years ago

for the sake of completeness:

jvm settings: -server -XX:NativeMemoryTracking=summary -XX:MaxRAMPercentage=90 -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:MaxMetaspaceSize=1g -XX:MetaspaceSize=256M -XX:SoftRefLRUPolicyMSPerMB=1 and -Xms1024m -Xmx1g

jenkins version: Jenkins 2.319.2

      resources:
        limits:
          cpu: '1'
          memory: 4Gi
        requests:
          cpu: 100m
          memory: 2Gi

pipeline build config - I only keep 2 failed ones (as the testcase is failing because of a failing test)

spec:
  failedBuildsHistoryLimit: 2
  successfulBuildsHistoryLimit: 50

and lastly - i switched the pipeline to use less durable instead of the default max durable image

I guess this is a HUGE Jenkins bug - but I have not found a way yet to really repro it outside our (complex) setup. see odsLeakingStage in the PR ...

maybe that's also something to try ... with the notes above to repro it in the first place - I simply ran out of time

clemensutschig commented 2 years ago

I started now 20 more runs (on top of the 44+ already in the system), baseline on memory is above, the rest below:

open file handles

678
sh-4.2$ lsof -p $(jps | cut -d ' ' -f1) | wc -l

total classloaders

Total = 2657                                                 28981  202238976  186931528  
ChunkSz: Total size of all allocated metaspace chunks
BlockSz: Total size of all allocated metaspace blocks (each chunk has several blocks)
sh-4.2$ jcmd  $(jps | cut -d ' ' -f1) VM.classloader_stats

top

   441 1006450+  20   0 7248476   2.5g  25268 S   8.7  4.0  76:00.04 java               

Container memory shows 3110meg (for whatever its worth :))

jafarre-bi commented 2 years ago
jcmd $(jps | cut -d ' ' -f1) VM.classloader_stats | grep Clean
0x0000000840d9b288  0x0000000840d98690  0x00007f725401c010      26     974848     937352  org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader
0x0000000840d9b288  0x0000000840d98690  0x00007f7328a33990    2072   17223680   17193832  org.jenkinsci.plugins.workflow.cps.CpsGroovyShell$CleanGroovyClassLoader
Total = 2436                                                 28860  201551872  186234384
lsof -p $(jps | cut -d ' ' -f1) | wc -l
699
top
441 1006450+  20   0 7140144   2.5g  24648 S   0.3  3.9  69:37.13 java
jafarre-bi commented 2 years ago
jcmd $(jps | cut -d ' ' -f1) VM.native_memory summary
441:

Native Memory Tracking:

Total: reserved=2803043KB, committed=1631671KB
-                 Java Heap (reserved=1048576KB, committed=1048576KB)
                            (mmap: reserved=1048576KB, committed=1048576KB) 

-                     Class (reserved=1253226KB, committed=239010KB)
                            (classes #28861)
                            (  instance classes #27538, array classes #1323)
                            (malloc=8042KB #191715) 
                            (mmap: reserved=1245184KB, committed=230968KB) 
                            (  Metadata:   )
                            (    reserved=204800KB, committed=204088KB)
                            (    used=163743KB)
                            (    free=40345KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=1040384KB, committed=26880KB)
                            (    used=18126KB)
                            (    free=8754KB)
                            (    waste=0KB =0.00%)

-                    Thread (reserved=74326KB, committed=8566KB)
                            (thread #72)
                            (stack: reserved=73984KB, committed=8224KB)
                            (malloc=260KB #434) 
                            (arena=82KB #142)

-                      Code (reserved=260467KB, committed=169071KB)
                            (malloc=12779KB #39910) 
                            (mmap: reserved=247688KB, committed=156292KB) 

-                        GC (reserved=98260KB, committed=98260KB)
                            (malloc=26440KB #316578) 
                            (mmap: reserved=71820KB, committed=71820KB) 

-                  Compiler (reserved=830KB, committed=830KB)
                            (malloc=698KB #2943) 
                            (arena=133KB #5)

-                  Internal (reserved=4282KB, committed=4282KB)
                            (malloc=4250KB #5224) 
                            (mmap: reserved=32KB, committed=32KB) 

-                     Other (reserved=459KB, committed=459KB)
                            (malloc=459KB #80) 

-                    Symbol (reserved=27101KB, committed=27101KB)
                            (malloc=24855KB #401638) 
                            (arena=2246KB #1)

-    Native Memory Tracking (reserved=15255KB, committed=15255KB)
                            (malloc=33KB #441) 
                            (tracking overhead=15221KB)

-        Shared class space (reserved=17024KB, committed=17024KB)
                            (mmap: reserved=17024KB, committed=17024KB) 

-               Arena Chunk (reserved=179KB, committed=179KB)
                            (malloc=179KB) 

-                   Logging (reserved=4KB, committed=4KB)
                            (malloc=4KB #192) 

-                 Arguments (reserved=19KB, committed=19KB)
                            (malloc=19KB #519) 

-                    Module (reserved=2532KB, committed=2532KB)
                            (malloc=2532KB #10123) 

-              Synchronizer (reserved=495KB, committed=495KB)
                            (malloc=495KB #4141) 

-                 Safepoint (reserved=8KB, committed=8KB)
                            (mmap: reserved=8KB, committed=8KB)
clemensutschig commented 2 years ago

Super!!!! The only thing - we loose filehandles- which is clearly shown - and this we need to track down- 678 vs 699 - the 20 runs!!!

jafarre-bi commented 2 years ago

I created a new baseline and launched 20 runs every time till 80 runs. This is a summary of the results:

Baseline 20 runs 40 runs 60 runs 80 runs
Clean Classloader instance # 0 8 2 2 2
Total Classloader instance # 588 2924 2319 2325 2422
File handles 660 696 693 674 698
Classes Kb reserved/committed 1150407/123207 +101019/+113619 +100659/+113259 +100721/+113321 +100857/+113457
Classes # 18817 +10630 +9926 +9967 +10025

No signs of leaks anymore, neither metaspace nor file handles.

At the end of the first 20 runs, 8 instances of the CleanGroovyClassloader were left. I couldn't reproduce it again. I suspect I just happened to watch right after finishing and they may take some time to be freed by the GC.

The last 20 runs took very long. I had to kill one of the runs, because it hung. I couldn't find the reason.

Seeing this data, my conclusion is that the cleanup code works fine, though it accesses private fields and internal API, and this will probably cause problems when updating in the future. I haven't found any explanation, yet, on why the weakly-reachable class loaders are not cleared up before throwing OOME. Changing the garbage collector does not help.

clemensutschig commented 2 years ago

@jafarre-viewnext @metmajer what do we do now?

metmajer commented 2 years ago

@s2oBCN please include all changes in this PR in your upcoming refactoring tests next week involving B*ASS data and document the numbers, thanks! Above results appear to be stable for most metrics. A great improvement to what we had. Thanks much!

victorpablosceruelo commented 2 years ago

@braisvq1996, you closed this issue and I do not know if it can be closed (yet). We have https://github.com/opendevstack/ods-core/pull/1161 (still draft), that @petod108 is testing with the end to end tests. We are still waiting to merge it and here we have important comments from @clemensutschig that we cannot forget about.