Closed J-N-K closed 5 years ago
The most recent openHAB2-Bundles build doesn't seem to have this issue.
The PR builds seem to use an Artifactory server running on the Jenkins machine for resolving artifacts (probably to cache artifacts to speed up builds). Though it looks like the artifact is also available in that repository:
https://ci.openhab.org/artifactory/libs-release/org/openhab/nrjavaserial/3.15.0.OH2/
We did recently upload a POM for nrjavaserial (https://github.com/openhab/openhab-distro/issues/947) which may have caused some resolution issue.
Maybe @kaikreuzer has seen similar issues before with this intricate setup? :thinking:
I have no clue, what could cause this. Especially, as there are also exceptions, that build successfully, like https://ci.openhab.org/job/PR-openHAB2-Addons/14750/.
Especially, as there are also exceptions
That Velux Binding PR (#2531) seems to have a workaround commit (https://github.com/openhab/openhab2-addons/commit/509d4061da2021548e92d0ce63315464716648a1) to get it build successfully. The workaround commit removes the bluegiga and modbus bundles from the Maven reactor.
The Jenkins PR builds seem to work again. Did someone create a fix or apply a workaround?
I removed the SAT execution (by adding -DskipChecks=true
) as the build exited with error 137, which is an OOM of the VM (not JVM).
I preferred to have them green again, but loosing the SAT checks clearly can only be a temporary measure...
Thanks for looking into it! Are you going to give the VM a bit more memory now so it can run SAT again?
It will probably still take a while to build everything with SAT enabled:
It could be improved by using faster CPUs or a parallel Maven build. I read that parallel builds with bndtools aren't safe (https://github.com/bndtools/bnd/issues/2801). But maybe we can improve SAT so it does work with parallel builds (https://github.com/openhab/static-code-analysis/issues/200). Maybe it can then build the code sequentially and run SAT in parallel afterwards. Running SAT in parallel still only makes sense if the VM has enough CPUs and memory.
Are you going to give the VM a bit more memory now
That's what I didn't manage to do - maybe you have some advice:
Our server has 8GB physical RAM (which should imho be fine to run at least 2 build jobs in parallel as they use a -Xmx2048m
setting).
The docker container is started with -m 8g --memory-reservation=6g
as otherwise it started growing and consumed swap as memory, which made everything terribly slow.
Note that Artifactory is running on the same machine in a second docker container (with docker -m 1532m
).
Do you have any idea how to change the configuration in a way that we won't see an exit code 137 anymore?
If I look at a memory usage graph of another Jenkins instance (also running in a container), Jenkins itself (without any running builds) is sometimes consuming 1.8 GB of memory. So if the openHAB instance has the same memory usage and there are two running builds it would already consume 5.8 GB of memory.
There could also be a memory leak in Jenkins itself. I encountered one a few years ago and had to restart Jenkins every couple of days. After I grew tired of that I installed the Monitoring plugin to find the root cause. The memory graph clearly showed memory was only increasing and after creating a heap dump (with the same plugin) the root cause was easily found.
That's a good idea, thanks - I have installed the plugin, so let's gather some statistics: https://ci.openhab.org/monitoring
Shall we close this issue or rename it? The original problem seems to be solved.
Let's close it.
Do you have any idea how to change the configuration in a way that we won't see an exit code 137 anymore?
According to the swap space graph Jenkins started consuming a lot of swap space since last thursday @kaikreuzer. The processes list shows there is a "zombie" build that is still hanging around and which has been started on that thursday (Aug 22nd).
Looks like the zombie build is caused by an NPE in Jenkins which occured in OH2 Addons PR build 14909.
The console logging shows the same command:
[PR-openHAB2-Addons] $ /var/jenkins_home/tools/hudson.model.JDK/Oracle_JDK_1.8_latest_/bin/java -Xms512m -Xmx2048m -Djava.awt.headless=true -cp /var/jenkins_home/plugins/maven-plugin/WEB-INF/lib/maven35-agent-1.13.jar:/var/jenkins_home/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.6.1/boot/plexus-classworlds-2.6.0.jar:/var/jenkins_home/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.6.1/conf/logging jenkins.maven3.agent.Maven35Main /var/jenkins_home/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.6.1 /var/jenkins_home/war/WEB-INF/lib/remoting-3.29.jar /var/jenkins_home/plugins/maven-plugin/WEB-INF/lib/maven35-interceptor-1.13.jar /var/jenkins_home/plugins/maven-plugin/WEB-INF/lib/maven3-interceptor-commons-1.13.jar 38393
At the end of the build the following NPE occurs:
project.getRootModule()=hudson.maven.MavenModule@5a6caa8[PR-openHAB2-Addons/org.openhab.addons:org.openhab.addons.reactor][PR-openHAB2-Addons/org.openhab.addons:org.openhab.addons.reactor][relativePath:]
FATAL: null
java.lang.NullPointerException
at hudson.maven.AbstractMavenBuilder.end(AbstractMavenBuilder.java:101)
at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:885)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
at hudson.model.Run.execute(Run.java:1818)
at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:543)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:429)
According to the line number the NPE occurs while iterating over the MavenReporters.
In the following open Jenkins issues the same issue/stacktrace is reported:
Thanks for your investigations, that's indeed pretty interesting. Unfortunately the two Jenkins issues are already pretty old, so there's hardly any chance for a fix. They seem to talk about issues when doing parallel builds and I noticed that we had the option "Parallele Builds ausführen, wenn notwendig" enabled - I am not sure whether it is this related, but I have deactivated it now; let's see if it makes any difference. I restarted the docker container, so that the zombie process is gone again and enabled the SAT checks for the PR builds - let's see what happens 🙏.
It looks like while build 14909 was running 14910 was started in parallel. Build 14910 was the first build for https://github.com/openhab/openhab2-addons/pull/1334 where the new org.openhab.binding.gpio
Maven module was added due to adapting the PR to bnd.
This module also shows up in the list of modules before the NPE is logged in the console output of build 14909. So a build might pick up modules from newer builds while there wasn't a MavenReporter registered causing the NPE.
So disabling the parallel builds setting on all jobs will probably workaround the issue but it also limits the number of concurrent builds per job to 1.
We might be able to reproduce the issue by triggering a build and immediately trigger another build for the same job that adds a new Maven module.
Looks like there's again a build stuck since Aug 30 eating swap space and causing builds to fail with "exit code 137" @kaikreuzer. See the processes.
Indeed, weird. Not sure what we can do about it as I have reduced Jenkins to have a single node, so it cannot ever do parallel executions... Well, restarting helped again.
Currently all Jenkins builds fail because nrjavaserial can't be found.
see e.g.: https://ci.openhab.org/job/PR-openHAB2-Addons/14714/console