Closed vbraun closed 7 years ago
Full log (but really nothing interesting) at http://build.sagemath.org/sage/builders/%20%20fast%20AIMS%20snapperkob%20%28Ubuntu%2012.04%20x86_64%29%20incremental/builds/28/steps/shell_3/logs/stdio
My guess would be that forke'd process writes temp file, and parent tries to read it after the fork quits. That is inherently racy since the sage cleaner will attempt to delete the child's temp files as it has another pid.
Changed keywords from none to random_fail
Still happens occasionally
Seen this today for the first ever I think. (Sage 7.3.rc0)
Replying to @nexttime:
Seen this today for the first ever I think. (Sage 7.3.rc0)
P.S.: Same error, different test:
sage -t --long --warn-long 68.2 src/sage/homology/simplicial_complex.py
**********************************************************************
File "src/sage/homology/simplicial_complex.py", line 2813, in sage.homology.simplicial_complex.SimplicialComplex.is_cohen_macaulay
Failed example:
X.is_cohen_macaulay(ZZ)
Expected:
False
Got:
[Errno 2] No such file or directory: '/home/leif/.sage/temp/tunguska/16183/dir_5o8pnH/16223.out'
False
**********************************************************************
I've seen this a few times recently, too.
I wonder if it would help if the tests all used just 1 CPU.
Replying to @jhpalmieri:
I wonder if it would help if the tests all used just 1 CPU.
Maybe, but that's not fixing the problem, just hiding it.
If you want to "fix" the problem but not hide it, just add # known bug
.
Replying to @jdemeyer:
Replying to @jhpalmieri:
I wonder if it would help if the tests all used just 1 CPU.
Maybe, but that's not fixing the problem, just hiding it.
If you want to "fix" the problem but not hide it, just add
# known bug
.
at least it would be good to know what part of the code in question writes temp files with extension .out (In my admittedly limited experience with parallel code I never saw slaves doing any file I/O; if they do they ought to clean up after themselves, otherwise there is not telling as to what will happen)
It would be also nice to know what kinds of systems produce the error: OS, number of CPUs, file system, etc.
Replying to @jhpalmieri:
It would be also nice to know what kinds of systems produce the error: OS, number of CPUs, file system, etc.
As a rule, I get it on a gentoo linux laptop, running on a 4-core Intel i7, and the usual ext4 file systems on an SSD.
See #22462.
Replying to @jhpalmieri:
It would be also nice to know what kinds of systems produce the error: OS, number of CPUs, file system, etc.
I almost always get this problem when running MAKE='make -j8' make ptestlong
on a Ubuntu 16.04, with 8 cpus, with file system ext4.
I am now running make testlong
in serially to see the difference.
I am now running
make testlong
in serially to see the difference.
I get All tests passed!
with make testlong
.
Dependencies: #22462
Author: Jeroen Demeyer
Description changed:
---
+++
@@ -12,4 +12,13 @@
[Errno 2] No such file or directory: '/home/buildbot/build/sage/snapperkob/sage_git/dot_sage/temp/snapperkob/10634/dir_n0BDmn/10759.out'
False
-Apparently the code uses @
parallel and there was already a race fixed in #14150.
+
+This is because a race condition in @parallel
. This is a parallel generator which forks processes, each process handling one item of the generator. The output of each finished process is stored as pickle in a working directory and then yield
ed by the main process.
+
+When the generator is closed (for example, the generator is used as argument to all()
and a False
condition is found), the following happens in a finally
block:
+
+1. The working directory is removed.
+
+2. The remaining processes are killed.
+
+This is a race condition because it can happen that a subprocess finishes between these steps. Then that process wants to write its output in the deleted directory. The fix is obvious: first kill the processes, then delete the directory.
Branch: u/jdemeyer/ticket/15585
On a machine that almost always gives the error on is_cohen_macaulay
, I get All tests passed!
on a single run of MAKE='make -j6' make ptestlong
.
On a second run of make ptestlong
, I still do not get the error. -> Great! Positive review.
Reviewer: Sébastien Labbé
Thanks for investigating this. I've been seeing this problem too, but thought it was a weird case of the sage-cleaner being overly aggressive for some reason.
Changed branch from u/jdemeyer/ticket/15585 to 51b2030
This is fairly unlikely but occasionally comes up on the buildbot:
This is because a race condition in
@parallel
. This is a parallel generator which forks processes, each process handling one item of the generator. The output of each finished process is stored as pickle in a working directory and thenyield
ed by the main process.When the generator is closed (for example, the generator is used as argument to
all()
and aFalse
condition is found), the following happens in afinally
block:The working directory is removed.
The remaining processes are killed.
This is a race condition because it can happen that a subprocess finishes between these steps. Then that process wants to write its output in the deleted directory. The fix is obvious: first kill the processes, then delete the directory.
Depends on #22462
CC: @jdemeyer @roed314 @seblabbe
Component: algebra
Keywords: random_fail
Author: Jeroen Demeyer
Branch/Commit:
51b2030
Reviewer: Sébastien Labbé
Issue created by migration from https://trac.sagemath.org/ticket/15585