Open jstolarek opened 10 years ago
As expected, with -O0
the module compiles in less than a few seconds and uses only 350MB of RAM. Still I wonder when this started happening.
I think something is confused here. It's Data.Singletons.Prelude.List that takes forever, not Data.Promotion.Prelude.List.
And, I'm pretty sure it's the high-arity zip
/zipWith
/unzip
functions that are doing it. I don't think it's the size of the splices, because GHC reports errors quite quickly, implying that singletons code is done running.
(This problem started when I moved a ton of functions from promotion-only to singletonizing, allowed now because of the work on #30.)
Yes, I meant Data.Singletons.Prelude.List
.
(This problem started when I moved a ton of functions from promotion-only to singletonizing, allowed now because of the work on #30.)
My guess was that this is related to #30 but my hypothesis was that this was due to changes in desugaring. But you may be right that this is caused by zips. I'll investigate.
I have just been bitten by this one again. This time - in conjunction with #10228 - I was unable to compile singletons
on one of my machines. Richard, your suggestion about zips seems correct. I extracted all zip*
, zipWith*
and unzip*
functions into a separate module but I was only able to compile it with -O0
. With optimisations turned on I ran out of memory and had to terminate the build. However, compiling Data.Singletons.Prelude.List
still uses too much memory and again I had to abort the build. We need to do something about this. A proper solution would be to investigate what's going on during compilation. But I'm not sure how we could do that. Is profiled build of GHC a solution here?
A not-so-proper solution would be to: a) extract zips into a separate module; b) disable optimisations both for that module and Data.Singletons.Prelude.List
module. That does not eliminate the causes but it at least hides the symptoms.
Richard, your thoughts?
I think the workaround you suggest is appropriate for singletons, but we should also isolate this problem in a way that we can report it to GHC. If the zips are the problem, is it possible to collect statistics based on the arity of the zip? For example, what growth function do you see as you increase the arity of the zip?
The fact that optimizations matter tells me that it's not typechecking that's holding us up. What does the core look like (with -ddump-simpl
)? Is this caused by flattening out matches? (For example, does the code passed to GHC -- the TH output -- grow exponentially?) I definitely think we narrow this down with some effort.
Optimisations matter but even without them the situation does not look too good. For now I extracted zips into a separate module and disabled optimisations - travis build no loner stalls under GHC 7.10. Will investigate further.
I measured compile times with GHC 7.10 and size of generated Core for zipWith
and uznip
functions. It's hard to say what the growth function for compile times is. It is certainly not linear, but other than that it can be quadratic, cubic or exponential - with so few data points it's really hard to tell. You can go to WolframAlpha and see the plots with these commands:
-- zipWith
linear fit {2,1.16},{3,1.43},{4,1.75},{5,2.37},{6,3.25},{7,4.55}
quadratic fit {2,1.16},{3,1.43},{4,1.75},{5,2.37},{6,3.25},{7,4.55}
cubic fit {2,1.16},{3,1.43},{4,1.75},{5,2.37},{6,3.25},{7,4.55}
exponential fit {2,1.16},{3,1.43},{4,1.75},{5,2.37},{6,3.25},{7,4.55}
--unzip
linear fit {2,1.3},{3,1.53},{4,2.0},{5,2.53},{6,3.2},{7,4.18}
quadratic fit {2,1.3},{3,1.53},{4,2.0},{5,2.53},{6,3.2},{7,4.18}
cubic fit {2,1.3},{3,1.53},{4,2.0},{5,2.53},{6,3.2},{7,4.18}
exponential fit {2,1.3},{3,1.53},{4,2.0},{5,2.53},{6,3.2},{7,4.18}
With zipWith3
I noticed that using overlapped patterns seems to be faster than listing all possible patterns without overlap - the difference is only around 0.1s but it seems to be pretty consistent.
Sizes of generated Core are at least quadratic, possibly cubic, but they are not exponential:
-- zipWith
linear fit {2,236790},{3,554413},{4,1097527},{5,1931507},{6,3131749},{7,4766252}
quadratic fit {2,236790},{3,554413},{4,1097527},{5,1931507},{6,3131749},{7,4766252}
cubic fit {2,236790},{3,554413},{4,1097527},{5,1931507},{6,3131749},{7,4766252}
exponential fit {2,236790},{3,554413},{4,1097527},{5,1931507},{6,3131749},{7,4766252}
-- unzip
linear fit {2,389963},{3,741555},{4,1266789},{5,2001982},{6,3004027},{7,4300641}
quadratic fit {2,389963},{3,741555},{4,1266789},{5,2001982},{6,3004027},{7,4300641}
cubic fit {2,389963},{3,741555},{4,1266789},{5,2001982},{6,3004027},{7,4300641}
exponential fit {2,389963},{3,741555},{4,1266789},{5,2001982},{6,3004027},{7,4300641}
Clearly much of the fault for long compile times is on our side. But still GHC performs worse than it used to.
Next: same measurements using GHC 7.8.4
I did the measurements for GHC 7.8.4 and the results are not at all what I expected.
Compile times (in seconds):
7.10.1 | 7.8.4 | |
---|---|---|
zipWith | 1.16 | 1.05 |
zipWith3 | 1.43 | 1.30 |
zipWith4 | 1.75 | 1.80 |
zipWith5 | 2.37 | 2.32 |
zipWith6 | 3.25 | 3.20 |
zipWith7 | 4.55 | 4.15 |
7.10.1 | 7.8.4 | |
---|---|---|
unzip | 1.3 | 1.2 |
unzip3 | 1.53 | 1.48 |
unzip4 | 2.00 | 1.95 |
unzip5 | 2.53 | 2.45 |
unzip6 | 3.2 | 3.1 |
unzip7 | 4.18 | 4.1 |
Core size (in bytes):
7.10.1 | 7.8.4 | |
---|---|---|
zipWith | 236790 | 267391 |
zipWith3 | 554413 | 615616 |
zipWith4 | 1097527 | 1201706 |
zipWith5 | 1931507 | 2090045 |
zipWith6 | 3131749 | 3356677 |
zipWith7 | 4766252 | 5070676 |
7.10.1 | 7.8.4 | |
---|---|---|
unzip | 389963 | 440257 |
unzip3 | 741555 | 817391 |
unzip4 | 1266789 | 1373322 |
unzip5 | 2001982 | 2146727 |
unzip6 | 3004027 | 3191546 |
unzip7 | 4300641 | 4536226 |
Compile times as slightly shorter for GHC 7.8.4. The difference is barely noticeable but seems consistent. Size of generated Core is larger for GHC 7.8.4 but I think the reason is that pretty-printer in 7.10 is less verbose, eg. it prints a ~ b
instead of (GHC.Types.~) a b
. So the generated Core seems to be of similar size.
Not sure what to do next. My hypothesis at the moment is that GHC 7.10 allocates more memory and that causes a noticeable slowdown when compiling our large modules (it might be causing swap usage or increase GC activity). I thought about measuring that by passing +RTS -s
to GHC during compilation. This spits out a lot of stats and I'm not sure which ones are relevant for us:
2,813,779,864 bytes allocated in the heap
430,103,464 bytes copied during GC
78,479,584 bytes maximum residency (10 sample(s))
2,408,152 bytes maximum slop
161 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 324 colls, 0 par 0.900s 0.904s 0.0028s 0.0480s
Gen 1 10 colls, 0 par 0.447s 0.450s 0.0450s 0.1750s
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.001s ( 0.001s elapsed)
MUT time 2.634s ( 2.967s elapsed)
GC time 1.348s ( 1.354s elapsed)
EXIT time 0.031s ( 0.032s elapsed)
Total time 4.045s ( 4.354s elapsed)
Alloc rate 1,068,213,806 bytes per MUT second
Productivity 66.7% of total user, 61.9% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
I think we should look at total bytes allocated, total memory in use and time spent in GC. Thoughts?
Some more numbers:
zipWith
- bytes allocated in the heap
7.10.1 | 7.8.4 | |
---|---|---|
zipWith | 422,458,240 | 482,987,584 |
zipWith3 | 610,538,208 | 673,563,488 |
zipWith4 | 906,088,024 | 967,362,248 |
zipWith5 | 1,342,030,560 | 1,395,288,784 |
zipWith6 | 1,966,988,472 | 1,992,980,232 |
zipWith7 | 2,815,774,192 | 2,796,970,504 |
zipWith
- total memory in use (MB)
7.10.1 | 7.8.4 | |
---|---|---|
zipWith | 99 | 92 |
zipWith3 | 100 | 93 |
zipWith4 | 99 | 126 |
zipWith5 | 145 | 137 |
zipWith6 | 134 | 201 |
zipWith7 | 236 | 203 |
zipWith
- productivity (%)
7.10.1 | 7.8.4 | |
---|---|---|
zipWith | 35.1 | 37.5 |
zipWith3 | 42.5 | 44.7 |
zipWith4 | 49.3 | 46.1 |
zipWith5 | 50.9 | 52.2 |
zipWith6 | 57.4 | 52.1 |
zipWith7 | 57.7 | 55.8 |
unzip
- bytes allocated in the heap
7.10.1 | 7.8.4 | |
---|---|---|
unzip | 532,047,816 | 594,717,456 |
unzip3 | 750,687,832 | 802,940,216 |
unzip4 | 1,058,652,952 | 1,099,362,272 |
unzip5 | 1,481,594,328 | 1,503,498,152 |
unzip6 | 2,057,421,224 | 2,034,910,696 |
unzip7 | 2,815,043,816 | 2,736,944,280 |
unzip
- total memory in use (MB)
7.10.1 | 7.8.4 | |
---|---|---|
unzip | 101 | 93 |
unzip3 | 82 | 93 |
unzip4 | 123 | 133 |
unzip5 | 135 | 140 |
unzip6 | 149 | 197 |
unzip7 | 161 | 199 |
unzip
- productivity (%)
7.10.1 | 7.8.4 | |
---|---|---|
unzip | 40.0 | 41.6 |
unzip3 | 46.5 | 47.2 |
unzip4 | 48.8 | 48.1 |
unzip5 | 54.4 | 53.6 |
unzip6 | 57.5 | 52.9 |
unzip7 | 66.1 | 57.2 |
I can't make any sense of them - it seems that 7.10 is doing better than 7.8... In all cases productivity is very poor.
As another tests I checked how much call arity analysis affects compilation time. I compiled Zips module using GHC 7.10.1 with -O
and -O -fno-call-arity
. Compilation time went down from 1m36s to 1m30s, so not much.
Now that match flattening is reverted the problem is gone. Closing in favour of #96.
Just to clarify a few points made above:
I'm going to reopen this ticket to continue to track progress, but only in the version with match flattening. The problem described here is an important part of #113, but not necessarily the only part of #113, so it's worth having a separate ticket.
I just compiled latest singletons and compiling
Data.Singletons.Prelude.List
allocated 1.5GB of RAM (and took ~20 minutes on my old machine). Did this use to happen?Based on my experience I guess that adding
-O0
will fix the issue. Alternatively, we could try to have smaller splices and see if this changes anything.