sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.33k stars 453 forks source link

Random failure in SimplicialComplex.is_cohen_macaulay #15585

Closed vbraun closed 7 years ago

vbraun commented 10 years ago

This is fairly unlikely but occasionally comes up on the buildbot:

sage -t --long src/sage/homology/simplicial_complex.py
**********************************************************************
File "src/sage/homology/simplicial_complex.py", line 2236, in sage.homology.simplicial_complex.SimplicialComplex.is_cohen_macaulay
Failed example:
    S.is_cohen_macaulay(ncpus=3)
Expected:
    False
Got:
    [Errno 2] No such file or directory: '/home/buildbot/build/sage/snapperkob/sage_git/dot_sage/temp/snapperkob/10634/dir_n0BDmn/10759.out'
    False

This is because a race condition in @parallel. This is a parallel generator which forks processes, each process handling one item of the generator. The output of each finished process is stored as pickle in a working directory and then yielded by the main process.

When the generator is closed (for example, the generator is used as argument to all() and a False condition is found), the following happens in a finally block:

  1. The working directory is removed.

  2. The remaining processes are killed.

This is a race condition because it can happen that a subprocess finishes between these steps. Then that process wants to write its output in the deleted directory. The fix is obvious: first kill the processes, then delete the directory.

Depends on #22462

CC: @jdemeyer @roed314 @seblabbe

Component: algebra

Keywords: random_fail

Author: Jeroen Demeyer

Branch/Commit: 51b2030

Reviewer: Sébastien Labbé

Issue created by migration from https://trac.sagemath.org/ticket/15585

vbraun commented 10 years ago
comment:1

Full log (but really nothing interesting) at http://build.sagemath.org/sage/builders/%20%20fast%20AIMS%20snapperkob%20%28Ubuntu%2012.04%20x86_64%29%20incremental/builds/28/steps/shell_3/logs/stdio

vbraun commented 10 years ago
comment:2

My guess would be that forke'd process writes temp file, and parent tries to read it after the fork quits. That is inherently racy since the sage cleaner will attempt to delete the child's temp files as it has another pid.

vbraun commented 10 years ago

Changed keywords from none to random_fail

vbraun commented 8 years ago
comment:7

Still happens occasionally

83660e46-0051-498b-a8c1-f7a7bd232b5a commented 8 years ago
comment:8

Seen this today for the first ever I think. (Sage 7.3.rc0)

83660e46-0051-498b-a8c1-f7a7bd232b5a commented 8 years ago
comment:9

Replying to @nexttime:

Seen this today for the first ever I think. (Sage 7.3.rc0)

P.S.: Same error, different test:

sage -t --long --warn-long 68.2 src/sage/homology/simplicial_complex.py
**********************************************************************
File "src/sage/homology/simplicial_complex.py", line 2813, in sage.homology.simplicial_complex.SimplicialComplex.is_cohen_macaulay
Failed example:
    X.is_cohen_macaulay(ZZ)
Expected:
    False
Got:
    [Errno 2] No such file or directory: '/home/leif/.sage/temp/tunguska/16183/dir_5o8pnH/16223.out'
    False
**********************************************************************
jhpalmieri commented 8 years ago
comment:10

I've seen this a few times recently, too.

dimpase commented 7 years ago
comment:12

still there in 7.5 and 7.6.betas.

jhpalmieri commented 7 years ago
comment:13

I wonder if it would help if the tests all used just 1 CPU.

jdemeyer commented 7 years ago
comment:14

Replying to @jhpalmieri:

I wonder if it would help if the tests all used just 1 CPU.

Maybe, but that's not fixing the problem, just hiding it.

If you want to "fix" the problem but not hide it, just add # known bug.

dimpase commented 7 years ago
comment:15

Replying to @jdemeyer:

Replying to @jhpalmieri:

I wonder if it would help if the tests all used just 1 CPU.

Maybe, but that's not fixing the problem, just hiding it.

If you want to "fix" the problem but not hide it, just add # known bug.

at least it would be good to know what part of the code in question writes temp files with extension .out (In my admittedly limited experience with parallel code I never saw slaves doing any file I/O; if they do they ought to clean up after themselves, otherwise there is not telling as to what will happen)

jhpalmieri commented 7 years ago
comment:16

It would be also nice to know what kinds of systems produce the error: OS, number of CPUs, file system, etc.

dimpase commented 7 years ago
comment:17

Replying to @jhpalmieri:

It would be also nice to know what kinds of systems produce the error: OS, number of CPUs, file system, etc.

As a rule, I get it on a gentoo linux laptop, running on a 4-core Intel i7, and the usual ext4 file systems on an SSD.

jdemeyer commented 7 years ago
comment:18

See #22462.

seblabbe commented 7 years ago
comment:19

Replying to @jhpalmieri:

It would be also nice to know what kinds of systems produce the error: OS, number of CPUs, file system, etc.

I almost always get this problem when running MAKE='make -j8' make ptestlong on a Ubuntu 16.04, with 8 cpus, with file system ext4.

I am now running make testlong in serially to see the difference.

seblabbe commented 7 years ago
comment:20

I am now running make testlong in serially to see the difference.

I get All tests passed! with make testlong.

jdemeyer commented 7 years ago

Dependencies: #22462

jdemeyer commented 7 years ago

Author: Jeroen Demeyer

jdemeyer commented 7 years ago

Description changed:

--- 
+++ 
@@ -12,4 +12,13 @@
     [Errno 2] No such file or directory: '/home/buildbot/build/sage/snapperkob/sage_git/dot_sage/temp/snapperkob/10634/dir_n0BDmn/10759.out'
     False

-Apparently the code uses @parallel and there was already a race fixed in #14150. + +This is because a race condition in @parallel. This is a parallel generator which forks processes, each process handling one item of the generator. The output of each finished process is stored as pickle in a working directory and then yielded by the main process. + +When the generator is closed (for example, the generator is used as argument to all() and a False condition is found), the following happens in a finally block: + +1. The working directory is removed. + +2. The remaining processes are killed. + +This is a race condition because it can happen that a subprocess finishes between these steps. Then that process wants to write its output in the deleted directory. The fix is obvious: first kill the processes, then delete the directory.

jdemeyer commented 7 years ago

Branch: u/jdemeyer/ticket/15585

jdemeyer commented 7 years ago

Commit: 51b2030

jdemeyer commented 7 years ago

New commits:

8f7ff57Use ContainChildren to implement p_iter_fork
a4dddccFurther fixes to use_fork
51b2030Fix race condition is p_iter_fork
seblabbe commented 7 years ago
comment:25

On a machine that almost always gives the error on is_cohen_macaulay, I get All tests passed! on a single run of MAKE='make -j6' make ptestlong.

seblabbe commented 7 years ago
comment:26

On a second run of make ptestlong, I still do not get the error. -> Great! Positive review.

seblabbe commented 7 years ago

Reviewer: Sébastien Labbé

embray commented 7 years ago
comment:28

Thanks for investigating this. I've been seeing this problem too, but thought it was a weird case of the sage-cleaner being overly aggressive for some reason.

vbraun commented 7 years ago

Changed branch from u/jdemeyer/ticket/15585 to 51b2030