Open timja opened 6 years ago
In all provided stacktraces Fingerprinting I/O comes from the Maven plugin which generates heavy fingerprinting load.
I will reassign the issue to the plugin maintainer, so far I do not see issue in the core implementation (although there are some discussions about changing the fingerprint storage in long-term).
java.lang.Thread.State: BLOCKED (on object monitor) at hudson.model.Fingerprint.add(Fingerprint.java:1014) - waiting to lock <0x00007f6124c61050> (a hudson.model.Fingerprint) at hudson.maven.reporters.MavenFingerprinter$1.call(MavenFingerprinter.java:125) at hudson.maven.reporters.MavenFingerprinter$1.call(MavenFingerprinter.java:112) at hudson.maven.MavenBuild$ProxyImpl.execute(MavenBuild.java:452) at hudson.maven.MavenBuild$ProxyImpl2.execute(MavenBuild.java:584) at sun.reflect.GeneratedMethodAccessor1033.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at hudson.model.Executor$2.call(Executor.java:911) at hudson.util.InterceptingProxy$1.invoke(InterceptingProxy.java:23) at com.sun.proxy.$Proxy117.execute(Unknown Source) at hudson.maven.MavenBuildProxy$Filter$AsyncInvoker.call(MavenBuildProxy.java:294) at hudson.remoting.UserRequest.perform(UserRequest.java:210) at hudson.remoting.UserRequest.perform(UserRequest.java:53) at hudson.remoting.Request$2.run(Request.java:364) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at org.jenkinsci.remoting.CallableDecorator.call(CallableDecorator.java:19) at hudson.remoting.CallableDecoratorList$1.call(CallableDecoratorList.java:21) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Not sure what we can do here
We can probably diagnose it more but not sure why it could impact only few users.
For sure Fingerprints are consuming a lot of IOs (resources) and are activated by default in maven jobs
Maybe it can impact only big instances but not sure which change in core or in the maven plugin could have increased this problem
capf:
We're also suffering from this, since we updated one jenkins from 2.89.4 to 2.121.2.
It looks like the I/O pattern for writing a single fingerprint could be improved quite a bit, but this is probably not the only cause for the performance degradation.
[pid 8789] stat("/var/jenkins_home/fingerprints/bf/e4", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [pid 8789] access("/var/jenkins_home/fingerprints/bf/e4", F_OK) = 0 [pid 8789] stat("/var/jenkins_home/fingerprints/bf/e4", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [pid 8789] lstat("/var/jenkins_home/fingerprints/bf/e4", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [pid 8789] mkdir("/var/jenkins_home/fingerprints/bf/e4", 0777) = -1 EEXIST (File exists) [pid 8789] lstat("/var/jenkins_home/fingerprints/bf/e4", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [pid 8789] stat("/var/jenkins_home/fingerprints/bf/e4/atomic8576544207056275611tmp", 0x7fe62a3e3f00) = -1 ENOENT (No such file or directory) [pid 8789] open("/var/jenkins_home/fingerprints/bf/e4/atomic8576544207056275611tmp", O_RDWR|O_CREAT|O_EXCL, 0666) = 5 [pid 8789] open("/var/jenkins_home/fingerprints/bf/e4/atomic8576544207056275611tmp", O_WRONLY) = 5 [pid 8789] rename("/var/jenkins_home/fingerprints/bf/e4/atomic8576544207056275611tmp", "/var/jenkins_home/fingerprints/bf/e4/c7f5648f164d878a47abeebea260.xml") = 0 [pid 8789] lstat("/var/jenkins_home/fingerprints/bf/e4/atomic8576544207056275611tmp", 0x7fe62a3e4020) = -1 ENOENT (No such file or directory)
The problem really only shows up when you have lots of fingerprints. In our case, this is around 330k, stored in 6,5k directories in the fingerprints directory.
capf:
The real cause is the use of AtomicFileWriter in https://github.com/jenkinsci/jenkins/blob/jenkins-2.121.2/core/src/main/java/hudson/model/Fingerprint.java#L1256
The backing FileChannelWriter has an option to force flush (FileChannel.force(true)) and AtomicFileWriter does this by default.
Luckily there is a system property that you can set in order to revert to the previous less safe, but better performing behavior:
-Dhudson.util.AtomicFileWriter.DISABLE_FORCED_FLUSH=true
Setting this brings back the previous performance.
capf:
So this is rather unrelated to the maven plugin and more of a Jenkins core issue. The maven plugin only brings up the issue because it creates fingerprints by default.
We disabled automatic fingerprint archiving for all maven jobs and it worked. I thinks the priority can be changed to Critical, because we have a simply workaround.
We'll try to use -Dhudson.util.AtomicFileWriter.DISABLE_FORCED_FLUSH=true, but I think the problem should be fixed, because this behaviour of Jenkins is real issue for a large instances.
capf:
One option would be to call FileChannel.force(false) instead of FileChannel.force(true) for the fingerprints in https://github.com/jenkinsci/jenkins/blame/jenkins-2.121.2/core/src/main/java/hudson/util/FileChannelWriter.java#L89 in order not to force-write all metadata. The manpage of fdatasync on Linux says: "The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.".
Not sure how much this would help, though.
I'm not quite sure if this is a related problem:
14 of 14 Build jobs in our Jenkins 2.107.3 are blocked in Fingerprint.add during collecting data from slaves.
Both, master and two slaves running on Solaris.
We are using Jenkins v.2.121.3 (upgraded couple of months back from v.2.73.3 ). We have around ~1000 MBranch + ~200 pipeline + ~200FreeStyle projects.
Post upgrade we are seeing CPU spike in the master server. We are not running any builds in master (primarily only MB indexings runs in master box).
Master Server - Linux 7.x (100+ cpu). We have gitLab* plugins for "SCM - GitLab" integration. GitLab hosted in separate Linux server. Can you please provide some tips to debug the issue !!
We are seeing the same behavior with v2.180
# time strace -c -f -p 25129 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 64.24 325.7725072399 135794 17935 futex 21.78 110.426436 800192 13861 restart_syscall 9.17 46.523219 35898 1296 epoll_wait 3.97 20.126873 972 2070529 recvfrom 0.52 2.624930 2763195 accept 0.13 0.642494 22 29615 getrusage 0.05 0.272781 24 11607 write
[pid 25189] futex(0x7fa0814d3a54, FUTEX_WAIT_BITSET_PRIVATE, 1, {372183, 895437760}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 25189] futex(0x7fa0814d3a28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 25189] futex(0x7fa0814d3a54, FUTEX_WAIT_BITSET_PRIVATE, 1, {372183, 945599747}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 25189] futex(0x7fa0814d3a28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 25189] futex(0x7fa0814d3a54, FUTEX_WAIT_BITSET_PRIVATE, 1, {372183, 995774240}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 25189] futex(0x7fa0814d3a28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 25189] futex(0x7fa0814d3a54, FUTEX_WAIT_BITSET_PRIVATE, 1, {372184, 45936909}, ffffffff) = -1 ETIMEDOUT (Connection timed out) [pid 25189] futex(0x7fa0814d3a28, FUTEX_WAKE_PRIVATE, 1) = 0
# java -jar /usr/lib/jenkins/jenkins.war --version 2.180
This is probably caused by https://github.com/jenkinsci/jenkins/pull/3170 which ensures data integrity.
One possible solution is to not use AtomicFileWriter for fingerprints but this can cause data loss. This doesn't seem to be a good solution since on a crash we will definitely lose data.
I think the first question is why are we writing to Fingerprints and the same Fingerprint so often and if we should avoid the excessive Fingerprints.
An improvement was made in the latest weekly 2.277, those experiencing this issue might want to try it out.
If you still have high CPU load preferably provide some threaddumps to confirm its the same issue we see previously.
Also are folks who experience issues with fingerprint.add using SSDs or plain old HDDs.
We have several heavily loaded instances of Jenkins. The most biggest of them builds around 30 000 jobs per day (two others - around 6000).
After upgrade to Jenkins 2.107.3 on the most heavily loaded instance we see a hight CPU sys time consumption. For users this looks like that the Jenkins works fast, but jobs build slow and the last step "Waiting for Jenkins to finish collecting data" takes a very long time (very similar to https://issues.jenkins-ci.org/browse/JENKINS-19712).
Moreover, some jobs simply looped at random steps:
After some investigation we have found that the processes, who consume CPU sys time most of all, are hanging on "futex" system call:
In strace output for such processes we can see something like that:
corresponding Java-process:
or that:
Environment: RHEL 7.5 with the latest updates (except of kernel). DRBD+ext4. OpenStack KVM. 32 CPU 64 GB RAM.
More diagnostic information in the attachments (we have created screenshots and ThreadDumps not at the same time, but at a time when we have seen the problem).
Originally reported by dkoryavov, imported from: High CPU sys time consumption after upgrade to 2.107.3