openhab / openhab-addons

Add-ons for openHAB
https://www.openhab.org/
Eclipse Public License 2.0
1.86k stars 3.57k forks source link

[infrastructure] Jenkins fails because of missing nrjavaserial bundle #5884

Closed J-N-K closed 5 years ago

J-N-K commented 5 years ago

Currently all Jenkins builds fail because nrjavaserial can't be found.

see e.g.: https://ci.openhab.org/job/PR-openHAB2-Addons/14714/console

wborn commented 5 years ago

The most recent openHAB2-Bundles build doesn't seem to have this issue.

The PR builds seem to use an Artifactory server running on the Jenkins machine for resolving artifacts (probably to cache artifacts to speed up builds). Though it looks like the artifact is also available in that repository:

https://ci.openhab.org/artifactory/libs-release/org/openhab/nrjavaserial/3.15.0.OH2/

We did recently upload a POM for nrjavaserial (https://github.com/openhab/openhab-distro/issues/947) which may have caused some resolution issue.

Maybe @kaikreuzer has seen similar issues before with this intricate setup? :thinking:

kaikreuzer commented 5 years ago

I have no clue, what could cause this. Especially, as there are also exceptions, that build successfully, like https://ci.openhab.org/job/PR-openHAB2-Addons/14750/.

wborn commented 5 years ago

Especially, as there are also exceptions

That Velux Binding PR (#2531) seems to have a workaround commit (https://github.com/openhab/openhab2-addons/commit/509d4061da2021548e92d0ce63315464716648a1) to get it build successfully. The workaround commit removes the bluegiga and modbus bundles from the Maven reactor.

wborn commented 5 years ago

The Jenkins PR builds seem to work again. Did someone create a fix or apply a workaround?

kaikreuzer commented 5 years ago

I removed the SAT execution (by adding -DskipChecks=true) as the build exited with error 137, which is an OOM of the VM (not JVM). I preferred to have them green again, but loosing the SAT checks clearly can only be a temporary measure...

wborn commented 5 years ago

Thanks for looking into it! Are you going to give the VM a bit more memory now so it can run SAT again?

It will probably still take a while to build everything with SAT enabled:

image

It could be improved by using faster CPUs or a parallel Maven build. I read that parallel builds with bndtools aren't safe (https://github.com/bndtools/bnd/issues/2801). But maybe we can improve SAT so it does work with parallel builds (https://github.com/openhab/static-code-analysis/issues/200). Maybe it can then build the code sequentially and run SAT in parallel afterwards. Running SAT in parallel still only makes sense if the VM has enough CPUs and memory.

kaikreuzer commented 5 years ago

Are you going to give the VM a bit more memory now

That's what I didn't manage to do - maybe you have some advice:

Our server has 8GB physical RAM (which should imho be fine to run at least 2 build jobs in parallel as they use a -Xmx2048m setting). The docker container is started with -m 8g --memory-reservation=6g as otherwise it started growing and consumed swap as memory, which made everything terribly slow. Note that Artifactory is running on the same machine in a second docker container (with docker -m 1532m).

Do you have any idea how to change the configuration in a way that we won't see an exit code 137 anymore?

wborn commented 5 years ago

If I look at a memory usage graph of another Jenkins instance (also running in a container), Jenkins itself (without any running builds) is sometimes consuming 1.8 GB of memory. So if the openHAB instance has the same memory usage and there are two running builds it would already consume 5.8 GB of memory.

There could also be a memory leak in Jenkins itself. I encountered one a few years ago and had to restart Jenkins every couple of days. After I grew tired of that I installed the Monitoring plugin to find the root cause. The memory graph clearly showed memory was only increasing and after creating a heap dump (with the same plugin) the root cause was easily found.

kaikreuzer commented 5 years ago

That's a good idea, thanks - I have installed the plugin, so let's gather some statistics: https://ci.openhab.org/monitoring

J-N-K commented 5 years ago

Shall we close this issue or rename it? The original problem seems to be solved.

kaikreuzer commented 5 years ago

Let's close it.

wborn commented 5 years ago

Do you have any idea how to change the configuration in a way that we won't see an exit code 137 anymore?

According to the swap space graph Jenkins started consuming a lot of swap space since last thursday @kaikreuzer. The processes list shows there is a "zombie" build that is still hanging around and which has been started on that thursday (Aug 22nd).

wborn commented 5 years ago

Looks like the zombie build is caused by an NPE in Jenkins which occured in OH2 Addons PR build 14909.

The console logging shows the same command:

[PR-openHAB2-Addons] $ /var/jenkins_home/tools/hudson.model.JDK/Oracle_JDK_1.8_latest_/bin/java -Xms512m -Xmx2048m -Djava.awt.headless=true -cp /var/jenkins_home/plugins/maven-plugin/WEB-INF/lib/maven35-agent-1.13.jar:/var/jenkins_home/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.6.1/boot/plexus-classworlds-2.6.0.jar:/var/jenkins_home/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.6.1/conf/logging jenkins.maven3.agent.Maven35Main /var/jenkins_home/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.6.1 /var/jenkins_home/war/WEB-INF/lib/remoting-3.29.jar /var/jenkins_home/plugins/maven-plugin/WEB-INF/lib/maven35-interceptor-1.13.jar /var/jenkins_home/plugins/maven-plugin/WEB-INF/lib/maven3-interceptor-commons-1.13.jar 38393

At the end of the build the following NPE occurs:

project.getRootModule()=hudson.maven.MavenModule@5a6caa8[PR-openHAB2-Addons/org.openhab.addons:org.openhab.addons.reactor][PR-openHAB2-Addons/org.openhab.addons:org.openhab.addons.reactor][relativePath:]
FATAL: null
java.lang.NullPointerException
    at hudson.maven.AbstractMavenBuilder.end(AbstractMavenBuilder.java:101)
    at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.doRun(MavenModuleSetBuild.java:885)
    at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:504)
    at hudson.model.Run.execute(Run.java:1818)
    at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:543)
    at hudson.model.ResourceController.execute(ResourceController.java:97)
    at hudson.model.Executor.run(Executor.java:429)

According to the line number the NPE occurs while iterating over the MavenReporters.

In the following open Jenkins issues the same issue/stacktrace is reported:

kaikreuzer commented 4 years ago

Thanks for your investigations, that's indeed pretty interesting. Unfortunately the two Jenkins issues are already pretty old, so there's hardly any chance for a fix. They seem to talk about issues when doing parallel builds and I noticed that we had the option "Parallele Builds ausführen, wenn notwendig" enabled - I am not sure whether it is this related, but I have deactivated it now; let's see if it makes any difference. I restarted the docker container, so that the zombie process is gone again and enabled the SAT checks for the PR builds - let's see what happens 🙏.

wborn commented 4 years ago

It looks like while build 14909 was running 14910 was started in parallel. Build 14910 was the first build for https://github.com/openhab/openhab2-addons/pull/1334 where the new org.openhab.binding.gpio Maven module was added due to adapting the PR to bnd.

This module also shows up in the list of modules before the NPE is logged in the console output of build 14909. So a build might pick up modules from newer builds while there wasn't a MavenReporter registered causing the NPE.

So disabling the parallel builds setting on all jobs will probably workaround the issue but it also limits the number of concurrent builds per job to 1.

We might be able to reproduce the issue by triggering a build and immediately trigger another build for the same job that adds a new Maven module.

wborn commented 4 years ago

Looks like there's again a build stuck since Aug 30 eating swap space and causing builds to fail with "exit code 137" @kaikreuzer. See the processes.

kaikreuzer commented 4 years ago

Indeed, weird. Not sure what we can do about it as I have reduced Jenkins to have a single node, so it cannot ever do parallel executions... Well, restarting helped again.