sofa-framework / sofa

Real-time multi-physics simulation with an emphasis on medical simulation.
https://www.sofa-framework.org
GNU Lesser General Public License v2.1
910 stars 310 forks source link

[CI] zombie test jobs #221

Closed bcarrez closed 7 years ago

bcarrez commented 7 years ago

We regularly have build slaves not responding on CentOS, Windows and MacOS.

After investigations, this is always caused by zombie test processes (Compliant_test) that disturb the daily cleaning job, and therefore the whole build machine. The only way to fix it for a while is to kill all proccesses on the corresponding slave, wipe the work folder and restart the slave.

damienmarchal commented 7 years ago

Let's make a killall *_test as a cleaning job executed all nights :)

maxime-tournier commented 7 years ago

Is there any log I could use to investigate ?

maxime-tournier commented 7 years ago

Found it, this is my fault (sorry):

I fixed the offending script, assuming it is the only one. I'll monitor the builds to see if there's more.

I'll try to add some mechanism to detect uncaught python exceptions during test execution, and fail the script should this happen.

With all my apologies for killing the CI engine :-/

damienmarchal commented 7 years ago

Hi Maxime,

This is great news. This evening I still had to kill manually some Compliant_test running in endless loop. Now the MacBuilder is "knocked out" for the night because there is no more space left on the device and we need to log in manually to clean that.

If you are curious here are the builds: 35GB https://ci.inria.fr/sofa-ci/job/mac_clang-3.4_options/4959/console 41GB https://ci.inria.fr/sofa-ci/job/mac_clang-3.4_options/4958/console

I hope your fix will remove this kind of problem.

matthieu-nesme commented 7 years ago

more than 40Go of text!!! In top of adding a stopping criterion based on execution time, we could also add a maximal size for the logging buffer.

bcarrez commented 7 years ago

I’m currently cleaning the ci infra disks, before anyone else notices… ;)

Le 28 mars 2017 à 09:36, Matthieu Nesme notifications@github.com a écrit :

more than 40Go of text!!! In top of adding a stopping criterion based on execution time, we could also add a maximal size for the logging buffer.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sofa-framework/sofa/issues/221#issuecomment-289688384, or mute the thread https://github.com/notifications/unsubscribe-auth/AR1ILlGbW9e6NmUklt1K8Zg1tlqpXtHkks5rqLiSgaJpZM4Mp6My.


Bruno Carrez SED Inria Lille-Nord Europe bruno.carrez@inria.fr

maxime-tournier commented 7 years ago

Well I found a fix, it involves sys.excepthook to register a toplevel handler for uncaught python errors.

In this case I simply abort the test. We should be fine now, but just in case we should also:

  1. limit logging size
  2. limit simulation time for tests

I'll look into the second part.

And again, my deepest apologies.

damienmarchal commented 7 years ago

Hi maxime,

These things happens so don't worry. Hopefully you found rapidely a quick fix and as far as understand this have not propagated to the whole sofa branches (which would have been more annoying).

damienmarchal commented 7 years ago

Congratulation Bruno @bcarrez for the new log messages ! !

maxime-tournier commented 7 years ago

New build job https://ci.inria.fr/sofa-ci/job/mac_clang-3.4_options/4965/console failed early due to no space left on device :-/

bcarrez commented 7 years ago

Killing all Compliant_test and SofaBoundaryCondition_test processes freed 60GB instantly on the disk…

Le 28 mars 2017 à 11:04, Maxime Tournier notifications@github.com a écrit :

New build job https://ci.inria.fr/sofa-ci/job/mac_clang-3.4_options/4965/console https://ci.inria.fr/sofa-ci/job/mac_clang-3.4_options/4965/console failed early due to no space left on device :-/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sofa-framework/sofa/issues/221#issuecomment-289708560, or mute the thread https://github.com/notifications/unsubscribe-auth/AR1ILpmA13OhA3MXyCB5IsMzZ9w6r2Dzks5rqM0ngaJpZM4Mp6My.


Bruno Carrez SED Inria Lille-Nord Europe bruno.carrez@inria.fr

matthieu-nesme commented 7 years ago

The test timeout should be done directly in the CI bash scripts, by running the gtest executables using the command timeout that can even send a signal to be able to print a specific message on the dashboard. https://www.gnu.org/software/coreutils/manual/html_node/timeout-invocation.html

Maybe it makes more sense that @guparan or @bcarrez have a look at it rather than @maxime-tournier?

guparan commented 7 years ago

Jenkins has a timeout feature to abort too long builds. This feature is enabled on all our jobs. The question is why are these builds not aborted by Jenkins?

guparan commented 7 years ago

Timeout on tests was disabled in 2bc5db53 with commit message "I have the intuition that this timeout mess might be the reason why continuous builds on Windows are so long. Let's see..." Should we re-enable this?

damienmarchal commented 7 years ago

@guparan the timeout in Jenkins is a timeout of "no activity" and not absolute...an application that constantly print on its ouput is not halted. This is clearly not enough.

We probably need both logics:

EDIT: one thing to add, timeout is not the panacea because a '3 hours timeout' (a sound value) already allows problematics build to cause a lot of harm in the CI. Our log files are always far below < 100 MB unless something goes really wrong. So my suggestion is some checks on the log file size to detect and cancel offending builds.

damienmarchal commented 7 years ago

Hello,

We still have problem in Compliant_test that is not killed by Jenkins: https://ci.inria.fr/sofa-ci/job/ubuntu_gcc-5.4_options/44/console

Maybe it came from Python_test.cpp and more specifically in Python_scene_test::run . There is an infinite loop that probably never exit when Python is stopped because of an error.

If someone have time to make that more robust any PR will be appreciated.

damienmarchal commented 7 years ago

I close this one...we will re-open it if we still found new problems.