sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.44k stars 479 forks source link

doc build hangs on Cygwin #23973

Closed embray closed 7 years ago

embray commented 7 years ago

For a few weeks now (I haven't been able to pursue the issue much due to travel) I've had to take the Cygwin patchbot down due to an issue that didn't occur before where building the docs, particularly with multiple processes, hangs indefinitely.

For some reason, the problem appears to begin with this commit. Before this commit the docs build no problem. After this commit it will build several documents and then all the worker processes will just deadlock it seems.

Upstream PR: https://github.com/ivmai/bdwgc/pull/187

Upstream: Fixed upstream, but not in a stable release.

Component: porting: Cygwin

Keywords: windows cygwin ecl gc

Author: Erik Bray

Branch: 5863356

Reviewer: Jeroen Demeyer

Issue created by migration from https://trac.sagemath.org/ticket/23973

jdemeyer commented 7 years ago
comment:1

Does this happen after a failure or during a successful build?

embray commented 7 years ago

Description changed:

--- 
+++ 
@@ -1,3 +1,3 @@
 For a few weeks now (I haven't been able to pursue the issue much due to travel) I've had to take the Cygwin patchbot down due to an issue that didn't occur before where building the docs, particularly with multiple processes, hangs indefinitely.

-I'm still trying to track down the commit where this issue began.  But in the meantime testing on Cygwin has been stunted due to this.  I hope to get to the bottom of it ASAP.
+For some reason, the problem appears to begin with [this commit](https://github.com/sagemath/sagetrac-mirror/commit/8cb8a4d58dd74922a5b66d6b20f0b3a7507ea611).  Before this commit the docs build no problem.  After this commit it will build several documents and then all the worker processes will just deadlock it seems.
embray commented 7 years ago
comment:3

Hard to say, but now that I take a careful look at the doc build log I think there might be an unhandled failure of some sort. I just tried a new build from scratch and here's what I found: The document en/constructions, that was edited in that commit, starts building (again, this is a parallel build so the log is interleaved):

[dochtml] Building en/constructions.
[dochtml]
[dochtml] [construct] loading pickled environment... not yet created
[dochtml] [construct] Compiling the master document
[dochtml] [construct] building [mo]: targets for 0 po files that are out of date
[dochtml] [construct] building [html]: targets for 16 source files that are out of date
[dochtml] [construct] updating environment: 16 added, 0 changed, 0 removed
[dochtml] [construct] reading sources... [  6%] algebraic_geometry
[dochtml] [a_tour_of] pickling environment... done
[dochtml] [a_tour_of] checking consistency... done
[dochtml] [a_tour_of] preparing documents... done
[dochtml] [a_tour_of] writing output... [100%] index
[dochtml] [construct] reading sources... [ 12%] calculus
[dochtml] [a_tour_of] generating indices... genindex
[dochtml] [a_tour_of] Merging js index files...
[dochtml] [a_tour_of] ... done (108 js index entries)
[dochtml] [a_tour_of] Writing js search indexes...writing additional pages... search
[dochtml] [a_tour_of] copying images... [ 50%] sin_plot.png
[dochtml] [a_tour_of] copying images... [100%] eigen_plot.png
[dochtml] Insufficient memory for black list

It's not clear where "Insufficient memory for black list" is coming from but probably en/constructions. After that there's no more output from that worker, but the remaining documents get built successfully, and then the build hangs.

So I guess this is a matter of poor error handling. Reminds me I should finish my work to generalize DocTestDispatcher.parallel_dispatch :)

embray commented 7 years ago
comment:4

Well this is definitely strange. I don't know why it would give an "Insufficient memory" error. This error is apparently coming from gc which is trying to initialize about 1 MB, and it definitely should be able to do that...

jdemeyer commented 7 years ago
comment:5

Sorry for the obvious question, but are you sure that you are not actually running out of memory? It's well known that the Sage docbuilder requires a lot of memory.

embray commented 7 years ago
comment:6

Yeah it's a fair question, but I'm pretty sure not. I have 32GB available and got this even when closing most other applications. I watched memory usage during a build and it still did not even come close to using all my system's memory. In any case, it fails almost deterministically to make this one 1 MB allocation, so that seems unlikely. I fear it might be a more subtle bug with gc and/or ecl, possibly having to do with fork (it wouldn't be the first one I've encountered).

jdemeyer commented 7 years ago
comment:7

Yes, 32GB should be enough, unless you are using too many processes. Does it work when building the docs serially? See also #21389.

embray commented 7 years ago
comment:8

Yes, it seems to occur specifically when using multiprocessing (I'm only using 4 processes for the present test, but I will try manually undoing #21389 and seeing if it happens even with 1 process). Still, that all makes me think it's nothing to do with the doc build itself, though I'm failing to reproduce the issue outside that context.

jdemeyer commented 7 years ago
comment:9

Given the commit that you pointed to, it seems that something goes wrong when producing that plot in the context of multiprocessing.

Maybe a simple workaround would be to disallow docbuilding in parallel on Cygwin?

embray commented 7 years ago
comment:10

Well yes, clearly. But that would be shuffling the problem under the rug (which might be fine if I can't find something else out soon...). It's in reading the sources for that file where it crashes. But I can reproduce that plot fine on my own--even running that code directly using multiprocessing.Pool appears to work fine. What's also strange is I can't reproduce the issue if I run a parallel doc build with just that document, or with a small handful of documents.

The code for that plot calls PiecewiseFunction.fourier_series_partial_sum which is using maxima do some symbolic integrations, and that's specifically where the failure must be occurring. I'm running the doc build again with GC_PRINT_VERBOSE_STATS=1 to see if anything turns up...

embray commented 7 years ago
comment:11

There are a few potentially relevant bug fixes in libgc since the current version in Sage. I'm going to try merging #23700 (for starters) and see if that helps.

embray commented 7 years ago
comment:12

Unfortunately simply upgrading to gc 7.6 did not resolve the problem. All it gave me was a little additional error output but nothing immediately helpful.

embray commented 7 years ago
comment:13

Replying to @embray:

Unfortunately simply upgrading to gc 7.6 did not resolve the problem. All it gave me was a little additional error output but nothing immediately helpful.

I take it back. Looking at the error message it gave it actually couldn't have been using gc 7.6. In fact for some reason when I upgraded gc it did not replace the old DLL.

embray commented 7 years ago
comment:14

I got gc 7.6 installed correctly this time, but unfortunately it still did not resolve the problem. The error message output during the build of this document is now slightly more helpful but not by much:

GC Warning: Out of memory - trying to allocate requested amount (8224 bytes)...
Insufficient memory for black list
embray commented 7 years ago
comment:15

I think I may have run across a possible answer to this conundrum, from Hans Boehm himself: http://www.hpl.hp.com/hosted/linux/mail-archives/gc/2006-March/001214.html

Indeed, it seems gc on Cygwin is using sbrk to allocate memory, and this runs into the hard-coded upper limit on the data segment of the Cygwin process (most memory allocation in Cygwin uses mmap which avoids these issues). I'll try rebuilding gc with mmap support enabled (which it is not, by default, on Cygwin, though not for any particular reason apparently). That should probably fix it...

embray commented 7 years ago

Upstream: Not yet reported upstream; Will do shortly.

embray commented 7 years ago

Branch: u/embray/cygwin/ticket-23973

embray commented 7 years ago

Author: Erik Bray

embray commented 7 years ago

Changed keywords from none to windows cygwin ecl gc

embray commented 7 years ago
comment:16

Here's a full summary of the issue so far, and how the attached patch addresses it:


New commits:

188ed32Enable munmap/mmap on Cygwin, but fix issues with its implementation on Cygwin;
embray commented 7 years ago

Commit: 188ed32

embray commented 7 years ago

Changed upstream from Not yet reported upstream; Will do shortly. to Reported upstream. No feedback yet.

embray commented 7 years ago

Description changed:

--- 
+++ 
@@ -1,3 +1,5 @@
 For a few weeks now (I haven't been able to pursue the issue much due to travel) I've had to take the Cygwin patchbot down due to an issue that didn't occur before where building the docs, particularly with multiple processes, hangs indefinitely.

 For some reason, the problem appears to begin with [this commit](https://github.com/sagemath/sagetrac-mirror/commit/8cb8a4d58dd74922a5b66d6b20f0b3a7507ea611).  Before this commit the docs build no problem.  After this commit it will build several documents and then all the worker processes will just deadlock it seems.
+
+**Upstream PR:** https://github.com/ivmai/bdwgc/pull/187
jdemeyer commented 7 years ago
comment:18

Pardon my ignorance, but are you sure that CYGWIN32 is the correct macro to check?

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Changed commit from 188ed32 to 5863356

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

5863356Updated patch level
embray commented 7 years ago
comment:20

Replying to @jdemeyer:

Pardon my ignorance, but are you sure that CYGWIN32 is the correct macro to check?

Yeah that's just the macro they use internally for "cygwin". The "32" part is a misnomer.

jdemeyer commented 7 years ago
comment:21

Isn't __CYGWIN__ the official macro to check for Cygwin?

embray commented 7 years ago
comment:22

It doesn't matter--have a look at the source code for gc. It defines its own macros for platform checks. This is only consistent with their style.

jdemeyer commented 7 years ago

Reviewer: Jeroen Demeyer

jdemeyer commented 7 years ago
comment:23

OK. I trust you on this one.

embray commented 7 years ago
comment:24

It would be helpful to have feedback from upstream in case there's somehow something outright wrong about this approach, but it at least resolved the problem, so...

vbraun commented 7 years ago

Changed branch from u/embray/cygwin/ticket-23973 to 5863356

embray commented 7 years ago

Changed upstream from Reported upstream. No feedback yet. to Fixed upstream, but not in a stable release.

embray commented 7 years ago

Changed commit from 5863356 to none