doc build hangs on Cygwin

embray commented 7 years ago

For a few weeks now (I haven't been able to pursue the issue much due to travel) I've had to take the Cygwin patchbot down due to an issue that didn't occur before where building the docs, particularly with multiple processes, hangs indefinitely.

For some reason, the problem appears to begin with this commit. Before this commit the docs build no problem. After this commit it will build several documents and then all the worker processes will just deadlock it seems.

Upstream PR: https://github.com/ivmai/bdwgc/pull/187

Upstream: Fixed upstream, but not in a stable release.

Component: porting: Cygwin

Keywords: windows cygwin ecl gc

Author: Erik Bray

Branch: 5863356

Reviewer: Jeroen Demeyer

Issue created by migration from https://trac.sagemath.org/ticket/23973

jdemeyer commented 7 years ago

comment:1

Does this happen after a failure or during a successful build?

embray commented 7 years ago

Description changed:

--- 
+++ 
@@ -1,3 +1,3 @@
 For a few weeks now (I haven't been able to pursue the issue much due to travel) I've had to take the Cygwin patchbot down due to an issue that didn't occur before where building the docs, particularly with multiple processes, hangs indefinitely.

-I'm still trying to track down the commit where this issue began.  But in the meantime testing on Cygwin has been stunted due to this.  I hope to get to the bottom of it ASAP.
+For some reason, the problem appears to begin with [this commit](https://github.com/sagemath/sagetrac-mirror/commit/8cb8a4d58dd74922a5b66d6b20f0b3a7507ea611).  Before this commit the docs build no problem.  After this commit it will build several documents and then all the worker processes will just deadlock it seems.

embray commented 7 years ago

comment:3

Hard to say, but now that I take a careful look at the doc build log I think there might be an unhandled failure of some sort. I just tried a new build from scratch and here's what I found: The document en/constructions, that was edited in that commit, starts building (again, this is a parallel build so the log is interleaved):

[dochtml] Building en/constructions.
[dochtml]
[dochtml] [construct] loading pickled environment... not yet created
[dochtml] [construct] Compiling the master document
[dochtml] [construct] building [mo]: targets for 0 po files that are out of date
[dochtml] [construct] building [html]: targets for 16 source files that are out of date
[dochtml] [construct] updating environment: 16 added, 0 changed, 0 removed
[dochtml] [construct] reading sources... [  6%] algebraic_geometry
[dochtml] [a_tour_of] pickling environment... done
[dochtml] [a_tour_of] checking consistency... done
[dochtml] [a_tour_of] preparing documents... done
[dochtml] [a_tour_of] writing output... [100%] index
[dochtml] [construct] reading sources... [ 12%] calculus
[dochtml] [a_tour_of] generating indices... genindex
[dochtml] [a_tour_of] Merging js index files...
[dochtml] [a_tour_of] ... done (108 js index entries)
[dochtml] [a_tour_of] Writing js search indexes...writing additional pages... search
[dochtml] [a_tour_of] copying images... [ 50%] sin_plot.png
[dochtml] [a_tour_of] copying images... [100%] eigen_plot.png
[dochtml] Insufficient memory for black list

It's not clear where "Insufficient memory for black list" is coming from but probably en/constructions. After that there's no more output from that worker, but the remaining documents get built successfully, and then the build hangs.

So I guess this is a matter of poor error handling. Reminds me I should finish my work to generalize DocTestDispatcher.parallel_dispatch :)

embray commented 7 years ago

comment:4

Well this is definitely strange. I don't know why it would give an "Insufficient memory" error. This error is apparently coming from gc which is trying to initialize about 1 MB, and it definitely should be able to do that...

jdemeyer commented 7 years ago

comment:5

Sorry for the obvious question, but are you sure that you are not actually running out of memory? It's well known that the Sage docbuilder requires a lot of memory.

embray commented 7 years ago

comment:6

Yeah it's a fair question, but I'm pretty sure not. I have 32GB available and got this even when closing most other applications. I watched memory usage during a build and it still did not even come close to using all my system's memory. In any case, it fails almost deterministically to make this one 1 MB allocation, so that seems unlikely. I fear it might be a more subtle bug with gc and/or ecl, possibly having to do with fork (it wouldn't be the first one I've encountered).

jdemeyer commented 7 years ago

comment:7

Yes, 32GB should be enough, unless you are using too many processes. Does it work when building the docs serially? See also #21389.

embray commented 7 years ago

comment:8

Yes, it seems to occur specifically when using multiprocessing (I'm only using 4 processes for the present test, but I will try manually undoing #21389 and seeing if it happens even with 1 process). Still, that all makes me think it's nothing to do with the doc build itself, though I'm failing to reproduce the issue outside that context.

jdemeyer commented 7 years ago

comment:9

Given the commit that you pointed to, it seems that something goes wrong when producing that plot in the context of multiprocessing.

Maybe a simple workaround would be to disallow docbuilding in parallel on Cygwin?

embray commented 7 years ago

comment:10

Well yes, clearly. But that would be shuffling the problem under the rug (which might be fine if I can't find something else out soon...). It's in reading the sources for that file where it crashes. But I can reproduce that plot fine on my own--even running that code directly using multiprocessing.Pool appears to work fine. What's also strange is I can't reproduce the issue if I run a parallel doc build with just that document, or with a small handful of documents.

The code for that plot calls PiecewiseFunction.fourier_series_partial_sum which is using maxima do some symbolic integrations, and that's specifically where the failure must be occurring. I'm running the doc build again with GC_PRINT_VERBOSE_STATS=1 to see if anything turns up...

embray commented 7 years ago

comment:11

There are a few potentially relevant bug fixes in libgc since the current version in Sage. I'm going to try merging #23700 (for starters) and see if that helps.

embray commented 7 years ago

comment:12

Unfortunately simply upgrading to gc 7.6 did not resolve the problem. All it gave me was a little additional error output but nothing immediately helpful.

embray commented 7 years ago

comment:13

Replying to @embray:

Unfortunately simply upgrading to gc 7.6 did not resolve the problem. All it gave me was a little additional error output but nothing immediately helpful.

I take it back. Looking at the error message it gave it actually couldn't have been using gc 7.6. In fact for some reason when I upgraded gc it did not replace the old DLL.

embray commented 7 years ago

comment:14

I got gc 7.6 installed correctly this time, but unfortunately it still did not resolve the problem. The error message output during the build of this document is now slightly more helpful but not by much:

GC Warning: Out of memory - trying to allocate requested amount (8224 bytes)...
Insufficient memory for black list

embray commented 7 years ago

comment:15

I think I may have run across a possible answer to this conundrum, from Hans Boehm himself: http://www.hpl.hp.com/hosted/linux/mail-archives/gc/2006-March/001214.html

Indeed, it seems gc on Cygwin is using sbrk to allocate memory, and this runs into the hard-coded upper limit on the data segment of the Cygwin process (most memory allocation in Cygwin uses mmap which avoids these issues). I'll try rebuilding gc with mmap support enabled (which it is not, by default, on Cygwin, though not for any particular reason apparently). That should probably fix it...

embray commented 7 years ago

Upstream: Not yet reported upstream; Will do shortly.

embray commented 7 years ago

Branch: u/embray/cygwin/ticket-23973

embray commented 7 years ago

Author: Erik Bray

embray commented 7 years ago

Changed keywords from none to windows cygwin ecl gc

embray commented 7 years ago

comment:16

Here's a full summary of the issue so far, and how the attached patch addresses it:

Cygwin processes reserve an address space for Cygwin's "private heap" for the process--most POSIX system calls that manipulate the heap (e.g. brk, sbrk) are manipulating Cygwin's heap.
- This private heap is of a fixed size and reserved at process startup--this allows it to be used freely without interference from Windows, which may be doing whatever it wants with the rest of the process's virtual address space.
- Most of the time this is not a problem--while Cygwin's malloc does make small allocations to the heap, larger allocations are made with mmap and can be anywhere in the Windows process's VM space (not within the private heap).
- However, a custom memory allocator (such as that used by libgc) may still run into problems here. In particular, the default configuration for libgc on Cygwin is to use sbrk only for memory allocations.
- It just happened that running the doc build in the manner I was running it deterministically (but otherwise arbitrarily) caused sbrk to bump into the fixed size limit of the private heap (at least I think--I have not directly confirmed this but it seems likely). This was happening to me in the context of the doc build, but in principle this error could have come up in some other context.
- The obvious solution is to build libgc to use mmap (which generally works well on Cygwin) and not rely on sbrk. Another workaround would be to manually increase the max heap size on the ecl executable (this can be done with the peflags utility) but that's only sweeping the problem under the rug somewhat.
libgc hard-codes a lot of the defaults for various settings depending on platform, such as USE_MMAP for whether or not it should use mmap for allocations (in fact, it seems on Linux this is not the default for some reason). The only way to force this at configure time, it seems, is to (seemingly strangely) pass --enable-munmap to configure.
- However, on Cygwin this does not enable USE_MMAP. It seems like maybe it used to, but now it also sets USE_WINALLOC, a setting whereby gc uses the Windows API directly for managing memory allocation.
- This has a surprising side-effect that it forces --enable-handle-fork to be disabled, as relying on direct handling of virtual memory allocations is inherently incompatible with Cygwin's fork emulation. However, we need this setting in order for ecl to work at all in a multi-process environment where it might be forked (see #22694)
- Therefore we need to force USE_MMAP as well as USE_MUNMAP to be defined by manually passing them in via CFLAGS. Other workarounds are possible of course but this is the simplest.
However there's a reason USE_MMAP was explicitly disabled on Cygwin.
- libgc's memory allocator uses algorithms that require it to be able to keep ranges of memory allocated but unusuable--in particular it remaps an existing mmap with the PROT_NONE flag to do this. This runs afoul of the same problem we had in #22810, where Cygwin does not like this.
- This can be fixed on Cygwin by simply using mprotect(..., PROT_NONE) here, instead of mmap. However, this alone is not sufficient and can lead quickly to failures because:
- The mprotect call must be aligned to the start of the mmap'd region, which libgc calculates using its GC_page_size global variable, which is initialized early on. However, it currently calculates this incorrectly on Cygwin; rather it uses it incorrectly. Although it does get the page size right, what it really needs is a value called the allocation granularity--the alignment for the start address of mmap'd regions. On many systems this is equal to the page size but not on 64-bit windows where the page size is 4k but the allocation granularity is 64k. Regions allocated by Cygwin's mmap are aligned to the latter. The easiest workaround here is to just use the allocation granularity for GC_page_size. There's not much sense in trying to use the actual page size anywhere in this case.

New commits:

188ed32 Enable munmap/mmap on Cygwin, but fix issues with its implementation on Cygwin;

embray commented 7 years ago

Commit: 188ed32

embray commented 7 years ago

Changed upstream from Not yet reported upstream; Will do shortly. to Reported upstream. No feedback yet.

embray commented 7 years ago

Description changed:

--- 
+++ 
@@ -1,3 +1,5 @@
 For a few weeks now (I haven't been able to pursue the issue much due to travel) I've had to take the Cygwin patchbot down due to an issue that didn't occur before where building the docs, particularly with multiple processes, hangs indefinitely.

 For some reason, the problem appears to begin with [this commit](https://github.com/sagemath/sagetrac-mirror/commit/8cb8a4d58dd74922a5b66d6b20f0b3a7507ea611).  Before this commit the docs build no problem.  After this commit it will build several documents and then all the worker processes will just deadlock it seems.
+
+**Upstream PR:** https://github.com/ivmai/bdwgc/pull/187

jdemeyer commented 7 years ago

comment:18

Pardon my ignorance, but are you sure that CYGWIN32 is the correct macro to check?

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Changed commit from 188ed32 to 5863356

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

5863356 Updated patch level

embray commented 7 years ago

comment:20

Replying to @jdemeyer:

Pardon my ignorance, but are you sure that CYGWIN32 is the correct macro to check?

Yeah that's just the macro they use internally for "cygwin". The "32" part is a misnomer.

jdemeyer commented 7 years ago

comment:21

Isn't __CYGWIN__ the official macro to check for Cygwin?

embray commented 7 years ago

comment:22

It doesn't matter--have a look at the source code for gc. It defines its own macros for platform checks. This is only consistent with their style.

jdemeyer commented 7 years ago

Reviewer: Jeroen Demeyer

jdemeyer commented 7 years ago

comment:23

OK. I trust you on this one.

embray commented 7 years ago

comment:24

It would be helpful to have feedback from upstream in case there's somehow something outright wrong about this approach, but it at least resolved the problem, so...

vbraun commented 7 years ago

Changed branch from u/embray/cygwin/ticket-23973 to 5863356

embray commented 7 years ago

Changed upstream from Reported upstream. No feedback yet. to Fixed upstream, but not in a stable release.

embray commented 7 years ago

Changed commit from 5863356 to none

sagemath / sage

doc build hangs on Cygwin #23973