Closed bcarrez closed 7 years ago
Let's make a killall *_test as a cleaning job executed all nights :)
Is there any log I could use to investigate ?
Found it, this is my fault (sorry):
I fixed the offending script, assuming it is the only one. I'll monitor the builds to see if there's more.
I'll try to add some mechanism to detect uncaught python exceptions during test execution, and fail the script should this happen.
With all my apologies for killing the CI engine :-/
Hi Maxime,
This is great news. This evening I still had to kill manually some Compliant_test running in endless loop. Now the MacBuilder is "knocked out" for the night because there is no more space left on the device and we need to log in manually to clean that.
If you are curious here are the builds: 35GB https://ci.inria.fr/sofa-ci/job/mac_clang-3.4_options/4959/console 41GB https://ci.inria.fr/sofa-ci/job/mac_clang-3.4_options/4958/console
I hope your fix will remove this kind of problem.
more than 40Go of text!!! In top of adding a stopping criterion based on execution time, we could also add a maximal size for the logging buffer.
I’m currently cleaning the ci infra disks, before anyone else notices… ;)
Le 28 mars 2017 à 09:36, Matthieu Nesme notifications@github.com a écrit :
more than 40Go of text!!! In top of adding a stopping criterion based on execution time, we could also add a maximal size for the logging buffer.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sofa-framework/sofa/issues/221#issuecomment-289688384, or mute the thread https://github.com/notifications/unsubscribe-auth/AR1ILlGbW9e6NmUklt1K8Zg1tlqpXtHkks5rqLiSgaJpZM4Mp6My.
Bruno Carrez SED Inria Lille-Nord Europe bruno.carrez@inria.fr
Well I found a fix, it involves sys.excepthook
to register a toplevel handler for uncaught python errors.
In this case I simply abort the test. We should be fine now, but just in case we should also:
I'll look into the second part.
And again, my deepest apologies.
Hi maxime,
These things happens so don't worry. Hopefully you found rapidely a quick fix and as far as understand this have not propagated to the whole sofa branches (which would have been more annoying).
Congratulation Bruno @bcarrez for the new log messages ! !
New build job https://ci.inria.fr/sofa-ci/job/mac_clang-3.4_options/4965/console failed early due to no space left on device :-/
Killing all Compliant_test and SofaBoundaryCondition_test processes freed 60GB instantly on the disk…
Le 28 mars 2017 à 11:04, Maxime Tournier notifications@github.com a écrit :
New build job https://ci.inria.fr/sofa-ci/job/mac_clang-3.4_options/4965/console https://ci.inria.fr/sofa-ci/job/mac_clang-3.4_options/4965/console failed early due to no space left on device :-/
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sofa-framework/sofa/issues/221#issuecomment-289708560, or mute the thread https://github.com/notifications/unsubscribe-auth/AR1ILpmA13OhA3MXyCB5IsMzZ9w6r2Dzks5rqM0ngaJpZM4Mp6My.
Bruno Carrez SED Inria Lille-Nord Europe bruno.carrez@inria.fr
The test timeout should be done directly in the CI bash scripts, by running the gtest executables using the command timeout
that can even send a signal to be able to print a specific message on the dashboard.
https://www.gnu.org/software/coreutils/manual/html_node/timeout-invocation.html
Maybe it makes more sense that @guparan or @bcarrez have a look at it rather than @maxime-tournier?
Jenkins has a timeout feature to abort too long builds. This feature is enabled on all our jobs. The question is why are these builds not aborted by Jenkins?
Timeout on tests was disabled in 2bc5db53 with commit message "I have the intuition that this timeout mess might be the reason why continuous builds on Windows are so long. Let's see..." Should we re-enable this?
@guparan the timeout in Jenkins is a timeout of "no activity" and not absolute...an application that constantly print on its ouput is not halted. This is clearly not enough.
We probably need both logics:
EDIT: one thing to add, timeout is not the panacea because a '3 hours timeout' (a sound value) already allows problematics build to cause a lot of harm in the CI. Our log files are always far below < 100 MB unless something goes really wrong. So my suggestion is some checks on the log file size to detect and cancel offending builds.
Hello,
We still have problem in Compliant_test that is not killed by Jenkins: https://ci.inria.fr/sofa-ci/job/ubuntu_gcc-5.4_options/44/console
Maybe it came from Python_test.cpp and more specifically in Python_scene_test::run . There is an infinite loop that probably never exit when Python is stopped because of an error.
If someone have time to make that more robust any PR will be appreciated.
I close this one...we will re-open it if we still found new problems.
We regularly have build slaves not responding on CentOS, Windows and MacOS.
After investigations, this is always caused by zombie test processes (Compliant_test) that disturb the daily cleaning job, and therefore the whole build machine. The only way to fix it for a while is to kill all proccesses on the corresponding slave, wipe the work folder and restart the slave.