Closed embray closed 7 years ago
Does this happen after a failure or during a successful build?
Description changed:
---
+++
@@ -1,3 +1,3 @@
For a few weeks now (I haven't been able to pursue the issue much due to travel) I've had to take the Cygwin patchbot down due to an issue that didn't occur before where building the docs, particularly with multiple processes, hangs indefinitely.
-I'm still trying to track down the commit where this issue began. But in the meantime testing on Cygwin has been stunted due to this. I hope to get to the bottom of it ASAP.
+For some reason, the problem appears to begin with [this commit](https://github.com/sagemath/sagetrac-mirror/commit/8cb8a4d58dd74922a5b66d6b20f0b3a7507ea611). Before this commit the docs build no problem. After this commit it will build several documents and then all the worker processes will just deadlock it seems.
Hard to say, but now that I take a careful look at the doc build log I think there might be an unhandled failure of some sort. I just tried a new build from scratch and here's what I found: The document en/constructions
, that was edited in that commit, starts building (again, this is a parallel build so the log is interleaved):
[dochtml] Building en/constructions.
[dochtml]
[dochtml] [construct] loading pickled environment... not yet created
[dochtml] [construct] Compiling the master document
[dochtml] [construct] building [mo]: targets for 0 po files that are out of date
[dochtml] [construct] building [html]: targets for 16 source files that are out of date
[dochtml] [construct] updating environment: 16 added, 0 changed, 0 removed
[dochtml] [construct] reading sources... [ 6%] algebraic_geometry
[dochtml] [a_tour_of] pickling environment... done
[dochtml] [a_tour_of] checking consistency... done
[dochtml] [a_tour_of] preparing documents... done
[dochtml] [a_tour_of] writing output... [100%] index
[dochtml] [construct] reading sources... [ 12%] calculus
[dochtml] [a_tour_of] generating indices... genindex
[dochtml] [a_tour_of] Merging js index files...
[dochtml] [a_tour_of] ... done (108 js index entries)
[dochtml] [a_tour_of] Writing js search indexes...writing additional pages... search
[dochtml] [a_tour_of] copying images... [ 50%] sin_plot.png
[dochtml] [a_tour_of] copying images... [100%] eigen_plot.png
[dochtml] Insufficient memory for black list
It's not clear where "Insufficient memory for black list" is coming from but probably en/constructions
. After that there's no more output from that worker, but the remaining documents get built successfully, and then the build hangs.
So I guess this is a matter of poor error handling. Reminds me I should finish my work to generalize DocTestDispatcher.parallel_dispatch
:)
Well this is definitely strange. I don't know why it would give an "Insufficient memory" error. This error is apparently coming from gc which is trying to initialize about 1 MB, and it definitely should be able to do that...
Sorry for the obvious question, but are you sure that you are not actually running out of memory? It's well known that the Sage docbuilder requires a lot of memory.
Yeah it's a fair question, but I'm pretty sure not. I have 32GB available and got this even when closing most other applications. I watched memory usage during a build and it still did not even come close to using all my system's memory. In any case, it fails almost deterministically to make this one 1 MB allocation, so that seems unlikely. I fear it might be a more subtle bug with gc and/or ecl, possibly having to do with fork (it wouldn't be the first one I've encountered).
Yes, 32GB should be enough, unless you are using too many processes. Does it work when building the docs serially? See also #21389.
Yes, it seems to occur specifically when using multiprocessing (I'm only using 4 processes for the present test, but I will try manually undoing #21389 and seeing if it happens even with 1 process). Still, that all makes me think it's nothing to do with the doc build itself, though I'm failing to reproduce the issue outside that context.
Given the commit that you pointed to, it seems that something goes wrong when producing that plot in the context of multiprocessing.
Maybe a simple workaround would be to disallow docbuilding in parallel on Cygwin?
Well yes, clearly. But that would be shuffling the problem under the rug (which might be fine if I can't find something else out soon...). It's in reading the sources for that file where it crashes. But I can reproduce that plot fine on my own--even running that code directly using multiprocessing.Pool
appears to work fine. What's also strange is I can't reproduce the issue if I run a parallel doc build with just that document, or with a small handful of documents.
The code for that plot calls PiecewiseFunction.fourier_series_partial_sum
which is using maxima do some symbolic integrations, and that's specifically where the failure must be occurring. I'm running the doc build again with GC_PRINT_VERBOSE_STATS=1
to see if anything turns up...
There are a few potentially relevant bug fixes in libgc since the current version in Sage. I'm going to try merging #23700 (for starters) and see if that helps.
Unfortunately simply upgrading to gc 7.6 did not resolve the problem. All it gave me was a little additional error output but nothing immediately helpful.
Replying to @embray:
Unfortunately simply upgrading to gc 7.6 did not resolve the problem. All it gave me was a little additional error output but nothing immediately helpful.
I take it back. Looking at the error message it gave it actually couldn't have been using gc 7.6. In fact for some reason when I upgraded gc it did not replace the old DLL.
I got gc 7.6 installed correctly this time, but unfortunately it still did not resolve the problem. The error message output during the build of this document is now slightly more helpful but not by much:
GC Warning: Out of memory - trying to allocate requested amount (8224 bytes)...
Insufficient memory for black list
I think I may have run across a possible answer to this conundrum, from Hans Boehm himself: http://www.hpl.hp.com/hosted/linux/mail-archives/gc/2006-March/001214.html
Indeed, it seems gc on Cygwin is using sbrk to allocate memory, and this runs into the hard-coded upper limit on the data segment of the Cygwin process (most memory allocation in Cygwin uses mmap which avoids these issues). I'll try rebuilding gc with mmap support enabled (which it is not, by default, on Cygwin, though not for any particular reason apparently). That should probably fix it...
Upstream: Not yet reported upstream; Will do shortly.
Branch: u/embray/cygwin/ticket-23973
Author: Erik Bray
Changed keywords from none to windows cygwin ecl gc
Here's a full summary of the issue so far, and how the attached patch addresses it:
brk
, sbrk
) are manipulating Cygwin's heap.
malloc
does make small allocations to the heap, larger allocations are made with mmap
and can be anywhere in the Windows process's VM space (not within the private heap).sbrk
only for memory allocations. sbrk
to bump into the fixed size limit of the private heap (at least I think--I have not directly confirmed this but it seems likely). This was happening to me in the context of the doc build, but in principle this error could have come up in some other context.mmap
(which generally works well on Cygwin) and not rely on sbrk
. Another workaround would be to manually increase the max heap size on the ecl
executable (this can be done with the peflags
utility) but that's only sweeping the problem under the rug somewhat.USE_MMAP
for whether or not it should use mmap
for allocations (in fact, it seems on Linux this is not the default for some reason). The only way to force this at configure time, it seems, is to (seemingly strangely) pass --enable-munmap
to configure
.
USE_MMAP
. It seems like maybe it used to, but now it also sets USE_WINALLOC
, a setting whereby gc uses the Windows API directly for managing memory allocation.--enable-handle-fork
to be disabled, as relying on direct handling of virtual memory allocations is inherently incompatible with Cygwin's fork emulation. However, we need this setting in order for ecl to work at all in a multi-process environment where it might be forked (see #22694)USE_MMAP
as well as USE_MUNMAP
to be defined by manually passing them in via CFLAGS
. Other workarounds are possible of course but this is the simplest.USE_MMAP
was explicitly disabled on Cygwin.
mmap
with the PROT_NONE
flag to do this. This runs afoul of the same problem we had in #22810, where Cygwin does not like this.mprotect(..., PROT_NONE)
here, instead of mmap
. However, this alone is not sufficient and can lead quickly to failures because:mprotect
call must be aligned to the start of the mmap
'd region, which libgc calculates using its GC_page_size
global variable, which is initialized early on. However, it currently calculates this incorrectly on Cygwin; rather it uses it incorrectly. Although it does get the page size right, what it really needs is a value called the allocation granularity--the alignment for the start address of mmap'd regions. On many systems this is equal to the page size but not on 64-bit windows where the page size is 4k but the allocation granularity is 64k. Regions allocated by Cygwin's mmap
are aligned to the latter. The easiest workaround here is to just use the allocation granularity for GC_page_size
. There's not much sense in trying to use the actual page size anywhere in this case.New commits:
188ed32 | Enable munmap/mmap on Cygwin, but fix issues with its implementation on Cygwin; |
Changed upstream from Not yet reported upstream; Will do shortly. to Reported upstream. No feedback yet.
Description changed:
---
+++
@@ -1,3 +1,5 @@
For a few weeks now (I haven't been able to pursue the issue much due to travel) I've had to take the Cygwin patchbot down due to an issue that didn't occur before where building the docs, particularly with multiple processes, hangs indefinitely.
For some reason, the problem appears to begin with [this commit](https://github.com/sagemath/sagetrac-mirror/commit/8cb8a4d58dd74922a5b66d6b20f0b3a7507ea611). Before this commit the docs build no problem. After this commit it will build several documents and then all the worker processes will just deadlock it seems.
+
+**Upstream PR:** https://github.com/ivmai/bdwgc/pull/187
Pardon my ignorance, but are you sure that CYGWIN32
is the correct macro to check?
Branch pushed to git repo; I updated commit sha1. New commits:
5863356 | Updated patch level |
Replying to @jdemeyer:
Pardon my ignorance, but are you sure that
CYGWIN32
is the correct macro to check?
Yeah that's just the macro they use internally for "cygwin". The "32" part is a misnomer.
Isn't __CYGWIN__
the official macro to check for Cygwin?
It doesn't matter--have a look at the source code for gc. It defines its own macros for platform checks. This is only consistent with their style.
Reviewer: Jeroen Demeyer
OK. I trust you on this one.
It would be helpful to have feedback from upstream in case there's somehow something outright wrong about this approach, but it at least resolved the problem, so...
Changed branch from u/embray/cygwin/ticket-23973 to 5863356
Changed upstream from Reported upstream. No feedback yet. to Fixed upstream, but not in a stable release.
For a few weeks now (I haven't been able to pursue the issue much due to travel) I've had to take the Cygwin patchbot down due to an issue that didn't occur before where building the docs, particularly with multiple processes, hangs indefinitely.
For some reason, the problem appears to begin with this commit. Before this commit the docs build no problem. After this commit it will build several documents and then all the worker processes will just deadlock it seems.
Upstream PR: https://github.com/ivmai/bdwgc/pull/187
Upstream: Fixed upstream, but not in a stable release.
Component: porting: Cygwin
Keywords: windows cygwin ecl gc
Author: Erik Bray
Branch:
5863356
Reviewer: Jeroen Demeyer
Issue created by migration from https://trac.sagemath.org/ticket/23973