timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-63840] Extreme CPU utilization leading to bad performance and failing jobs #1228

Open timja opened 4 years ago

timja commented 4 years ago

Jenkins consumes vast amounts of CPU resources. E.g. on one machine during 103 days uptime htop reports 7387 CPU hours, i.e. in the mean 3 out of 8 cores are 100% utilzed during uptime. Actually it seems to be worse:

  1  [||||||||||||||||||||||||||||||||||||||||||||   71.7%]   5  [|||||||||||||||||||||||||||||||||||||||||||||||||||    84.0%]
  2  [|||||||||||||||||||||||||||||||||||||||||||||||||      79.6%]   6  [|||||||||||||||||||||||||||||||||||||||||||    69.4%]
  3  [|||||||||||||||||||||||||||||||||||||||||||||||76.6%]   7  [|||||||||||||||||||||||||||||||||||||||||||||||||||||| 88.7%]
  4  [||||||||||||||||||||||||||||||||||||||||||||||||||||   85.2%]   8  [|||||||||||||||||||||||||||||||||||||||||||||||||||    82.4%]
  Mem[|||||||||||||||||||||||||||||||||||||||||||||||||4.03G/7.79G]   Tasks: 48; 6 running
  Swp[| 253M/16.0G]   Load average: 8.30 8.02 8.28
      Uptime: 103 days(!), 20:54:02

603 jenkins    20   0 7779M 2524M  5564 S 421. 31.6    7387h ├─ /java/jdk8/bin/java -XX:ErrorFile=/.mntpts/sdb1/jenkins/hs_err_pidjenkins...

A possible cause of this might be {{AsyncResourceDisposer.worker}}s that run into errors due to insufficient file permissions for the user (jenkins) that runs the Jenkins daemon. It seems that these workers are unable to recover and skip the offending files and directories.

In our specific case the files have been created by Docker containers, but it may have other causes as well, so it's probably a quite common source of problems.

As a sidenote, I did not find the error in Jenkin's log files, which made it difficult to track down possible reasons for the bad Jenkins performance and Jobs that failed for no good reasons.

Name: AsyncResourceDisposer.worker [#9]
State: RUNNABLE
Total blocked: 33  Total waited: 11.060

Stack trace: 
java.lang.Throwable.fillInStackTrace(Native Method)
java.lang.Throwable.fillInStackTrace(Throwable.java:783)
   - locked java.nio.file.AccessDeniedException@45ff02ae
java.lang.Throwable.(Throwable.java:265)
java.lang.Exception.(Exception.java:66)
java.io.IOException.(IOException.java:58)
java.nio.file.FileSystemException.(FileSystemException.java:73)
java.nio.file.AccessDeniedException.(AccessDeniedException.java:66)
sun.nio.fs.UnixException.translateToIOException(UnixException.java:84)
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108)
java.nio.file.Files.deleteIfExists(Files.java:1165)
jenkins.util.io.PathRemover.removeOrMakeRemovableThenRemove(PathRemover.java:237)
jenkins.util.io.PathRemover.tryRemoveFile(PathRemover.java:205)
jenkins.util.io.PathRemover.tryRemoveRecursive(PathRemover.java:216)
jenkins.util.io.PathRemover.tryRemoveDirectoryContents(PathRemover.java:226)
jenkins.util.io.PathRemover.tryRemoveRecursive(PathRemover.java:215)
jenkins.util.io.PathRemover.tryRemoveDirectoryContents(PathRemover.java:226)
jenkins.util.io.PathRemover.tryRemoveRecursive(PathRemover.java:215)
jenkins.util.io.PathRemover.tryRemoveDirectoryContents(PathRemover.java:226)
jenkins.util.io.PathRemover.tryRemoveRecursive(PathRemover.java:215)
jenkins.util.io.PathRemover.tryRemoveDirectoryContents(PathRemover.java:226)
jenkins.util.io.PathRemover.tryRemoveRecursive(PathRemover.java:215)
jenkins.util.io.PathRemover.tryRemoveDirectoryContents(PathRemover.java:226)
jenkins.util.io.PathRemover.tryRemoveRecursive(PathRemover.java:215)
jenkins.util.io.PathRemover.tryRemoveDirectoryContents(PathRemover.java:226)
jenkins.util.io.PathRemover.tryRemoveRecursive(PathRemover.java:215)
jenkins.util.io.PathRemover.tryRemoveDirectoryContents(PathRemover.java:226)
jenkins.util.io.PathRemover.tryRemoveRecursive(PathRemover.java:215)
jenkins.util.io.PathRemover.tryRemoveDirectoryContents(PathRemover.java:226)
jenkins.util.io.PathRemover.tryRemoveRecursive(PathRemover.java:215)
jenkins.util.io.PathRemover.tryRemoveDirectoryContents(PathRemover.java:226)
jenkins.util.io.PathRemover.tryRemoveRecursive(PathRemover.java:215)
jenkins.util.io.PathRemover.tryRemoveDirectoryContents(PathRemover.java:226)
jenkins.util.io.PathRemover.tryRemoveRecursive(PathRemover.java:215)
jenkins.util.io.PathRemover.tryRemoveDirectoryContents(PathRemover.java:226)
jenkins.util.io.PathRemover.tryRemoveRecursive(PathRemover.java:215)
jenkins.util.io.PathRemover.forceRemoveRecursive(PathRemover.java:96)
hudson.Util.deleteRecursive(Util.java:296)
hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1310)
hudson.FilePath$DeleteRecursive.invoke(FilePath.java:1306)
hudson.FilePath.act(FilePath.java:1076)
hudson.FilePath.act(FilePath.java:1059)
hudson.FilePath.deleteRecursive(FilePath.java:1304)
hudson.plugins.ws_cleanup.Wipeout.performDelete(Wipeout.java:82)
hudson.plugins.ws_cleanup.Wipeout$DisposableImpl.dispose(Wipeout.java:114)
org.jenkinsci.plugins.resourcedisposer.AsyncResourceDisposer$WorkItem.run(AsyncResourceDisposer.java:283)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.run(FutureTask.java:266)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

Originally reported by alexanderv, imported from: Extreme CPU utilization leading to bad performance and failing jobs
  • status: Open
  • priority: Major
  • resolution: Unresolved
  • imported: 2022/01/10
timja commented 4 years ago

oleg_nenashev:

Hi thanks for the report!

> A possible cause of this might be {{AsyncResourceDisposer.worker}}s that run into errors due to insufficient file permissions for the user (jenkins) that runs the Jenkins daemon. It seems that these workers are unable to recover and skip the offending files and directories.

It is plausible, but it needs more investigation and analysis. One particular stack trace might be a red herring. If you have a test instance where you could reproduce the behavior, connecting a profiler would be a good step to collect the details (e.g. Java Flight Recorded). Tools like jvmtop or Monitoring plugin could also help to collect some per-thread diagnostics information from the instance.

 

> As a sidenote, I did not find the error in Jenkin's log files, which made it difficult to track down possible reasons

it is possible to go through the code in the stacktrace and discover what exactly suppresses it