Memory leak when compiling Data.Singletons.Prelude.List with match flattening

jstolarek commented 10 years ago

I just compiled latest singletons and compiling Data.Singletons.Prelude.List allocated 1.5GB of RAM (and took ~20 minutes on my old machine). Did this use to happen?

Based on my experience I guess that adding -O0 will fix the issue. Alternatively, we could try to have smaller splices and see if this changes anything.

jstolarek commented 10 years ago

As expected, with -O0 the module compiles in less than a few seconds and uses only 350MB of RAM. Still I wonder when this started happening.

goldfirere commented 10 years ago

I think something is confused here. It's Data.Singletons.Prelude.List that takes forever, not Data.Promotion.Prelude.List.

And, I'm pretty sure it's the high-arity zip/zipWith/unzip functions that are doing it. I don't think it's the size of the splices, because GHC reports errors quite quickly, implying that singletons code is done running.

goldfirere commented 10 years ago

(This problem started when I moved a ton of functions from promotion-only to singletonizing, allowed now because of the work on #30.)

jstolarek commented 10 years ago

Yes, I meant Data.Singletons.Prelude.List.

(This problem started when I moved a ton of functions from promotion-only to singletonizing, allowed now because of the work on #30.)

My guess was that this is related to #30 but my hypothesis was that this was due to changes in desugaring. But you may be right that this is caused by zips. I'll investigate.

jstolarek commented 9 years ago

I have just been bitten by this one again. This time - in conjunction with #10228 - I was unable to compile singletons on one of my machines. Richard, your suggestion about zips seems correct. I extracted all zip*, zipWith* and unzip* functions into a separate module but I was only able to compile it with -O0. With optimisations turned on I ran out of memory and had to terminate the build. However, compiling Data.Singletons.Prelude.List still uses too much memory and again I had to abort the build. We need to do something about this. A proper solution would be to investigate what's going on during compilation. But I'm not sure how we could do that. Is profiled build of GHC a solution here?

A not-so-proper solution would be to: a) extract zips into a separate module; b) disable optimisations both for that module and Data.Singletons.Prelude.List module. That does not eliminate the causes but it at least hides the symptoms.

Richard, your thoughts?

goldfirere commented 9 years ago

I think the workaround you suggest is appropriate for singletons, but we should also isolate this problem in a way that we can report it to GHC. If the zips are the problem, is it possible to collect statistics based on the arity of the zip? For example, what growth function do you see as you increase the arity of the zip?

The fact that optimizations matter tells me that it's not typechecking that's holding us up. What does the core look like (with -ddump-simpl)? Is this caused by flattening out matches? (For example, does the code passed to GHC -- the TH output -- grow exponentially?) I definitely think we narrow this down with some effort.

jstolarek commented 9 years ago

Optimisations matter but even without them the situation does not look too good. For now I extracted zips into a separate module and disabled optimisations - travis build no loner stalls under GHC 7.10. Will investigate further.

jstolarek commented 9 years ago

I measured compile times with GHC 7.10 and size of generated Core for zipWith and uznip functions. It's hard to say what the growth function for compile times is. It is certainly not linear, but other than that it can be quadratic, cubic or exponential - with so few data points it's really hard to tell. You can go to WolframAlpha and see the plots with these commands:

-- zipWith
linear fit {2,1.16},{3,1.43},{4,1.75},{5,2.37},{6,3.25},{7,4.55}
quadratic fit {2,1.16},{3,1.43},{4,1.75},{5,2.37},{6,3.25},{7,4.55}
cubic fit {2,1.16},{3,1.43},{4,1.75},{5,2.37},{6,3.25},{7,4.55}
exponential fit {2,1.16},{3,1.43},{4,1.75},{5,2.37},{6,3.25},{7,4.55}

--unzip
linear fit {2,1.3},{3,1.53},{4,2.0},{5,2.53},{6,3.2},{7,4.18}
quadratic fit {2,1.3},{3,1.53},{4,2.0},{5,2.53},{6,3.2},{7,4.18}
cubic fit {2,1.3},{3,1.53},{4,2.0},{5,2.53},{6,3.2},{7,4.18}
exponential fit {2,1.3},{3,1.53},{4,2.0},{5,2.53},{6,3.2},{7,4.18}

With zipWith3 I noticed that using overlapped patterns seems to be faster than listing all possible patterns without overlap - the difference is only around 0.1s but it seems to be pretty consistent.

Sizes of generated Core are at least quadratic, possibly cubic, but they are not exponential:

-- zipWith
linear fit {2,236790},{3,554413},{4,1097527},{5,1931507},{6,3131749},{7,4766252}
quadratic fit {2,236790},{3,554413},{4,1097527},{5,1931507},{6,3131749},{7,4766252}
cubic fit {2,236790},{3,554413},{4,1097527},{5,1931507},{6,3131749},{7,4766252}
exponential  fit {2,236790},{3,554413},{4,1097527},{5,1931507},{6,3131749},{7,4766252}

-- unzip
linear fit {2,389963},{3,741555},{4,1266789},{5,2001982},{6,3004027},{7,4300641}
quadratic fit {2,389963},{3,741555},{4,1266789},{5,2001982},{6,3004027},{7,4300641}
cubic fit {2,389963},{3,741555},{4,1266789},{5,2001982},{6,3004027},{7,4300641}
exponential fit {2,389963},{3,741555},{4,1266789},{5,2001982},{6,3004027},{7,4300641}

Clearly much of the fault for long compile times is on our side. But still GHC performs worse than it used to.

Next: same measurements using GHC 7.8.4

jstolarek commented 9 years ago

I did the measurements for GHC 7.8.4 and the results are not at all what I expected.

Compile times (in seconds):

	7.10.1	7.8.4
zipWith	1.16	1.05
zipWith3	1.43	1.30
zipWith4	1.75	1.80
zipWith5	2.37	2.32
zipWith6	3.25	3.20
zipWith7	4.55	4.15

	7.10.1	7.8.4
unzip	1.3	1.2
unzip3	1.53	1.48
unzip4	2.00	1.95
unzip5	2.53	2.45
unzip6	3.2	3.1
unzip7	4.18	4.1

Core size (in bytes):

	7.10.1	7.8.4
zipWith	236790	267391
zipWith3	554413	615616
zipWith4	1097527	1201706
zipWith5	1931507	2090045
zipWith6	3131749	3356677
zipWith7	4766252	5070676

	7.10.1	7.8.4
unzip	389963	440257
unzip3	741555	817391
unzip4	1266789	1373322
unzip5	2001982	2146727
unzip6	3004027	3191546
unzip7	4300641	4536226

Compile times as slightly shorter for GHC 7.8.4. The difference is barely noticeable but seems consistent. Size of generated Core is larger for GHC 7.8.4 but I think the reason is that pretty-printer in 7.10 is less verbose, eg. it prints a ~ b instead of (GHC.Types.~) a b. So the generated Core seems to be of similar size.

Not sure what to do next. My hypothesis at the moment is that GHC 7.10 allocates more memory and that causes a noticeable slowdown when compiling our large modules (it might be causing swap usage or increase GC activity). I thought about measuring that by passing +RTS -s to GHC during compilation. This spits out a lot of stats and I'm not sure which ones are relevant for us:

   2,813,779,864 bytes allocated in the heap
     430,103,464 bytes copied during GC
      78,479,584 bytes maximum residency (10 sample(s))
       2,408,152 bytes maximum slop
             161 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       324 colls,     0 par    0.900s   0.904s     0.0028s    0.0480s
  Gen  1        10 colls,     0 par    0.447s   0.450s     0.0450s    0.1750s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    2.634s  (  2.967s elapsed)
  GC      time    1.348s  (  1.354s elapsed)
  EXIT    time    0.031s  (  0.032s elapsed)
  Total   time    4.045s  (  4.354s elapsed)

  Alloc rate    1,068,213,806 bytes per MUT second

  Productivity  66.7% of total user, 61.9% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

I think we should look at total bytes allocated, total memory in use and time spent in GC. Thoughts?

jstolarek commented 9 years ago

Some more numbers:

zipWith - bytes allocated in the heap

	7.10.1	7.8.4
zipWith	422,458,240	482,987,584
zipWith3	610,538,208	673,563,488
zipWith4	906,088,024	967,362,248
zipWith5	1,342,030,560	1,395,288,784
zipWith6	1,966,988,472	1,992,980,232
zipWith7	2,815,774,192	2,796,970,504

zipWith - total memory in use (MB)

	7.10.1	7.8.4
zipWith	99	92
zipWith3	100	93
zipWith4	99	126
zipWith5	145	137
zipWith6	134	201
zipWith7	236	203

zipWith - productivity (%)

	7.10.1	7.8.4
zipWith	35.1	37.5
zipWith3	42.5	44.7
zipWith4	49.3	46.1
zipWith5	50.9	52.2
zipWith6	57.4	52.1
zipWith7	57.7	55.8

unzip - bytes allocated in the heap

	7.10.1	7.8.4
unzip	532,047,816	594,717,456
unzip3	750,687,832	802,940,216
unzip4	1,058,652,952	1,099,362,272
unzip5	1,481,594,328	1,503,498,152
unzip6	2,057,421,224	2,034,910,696
unzip7	2,815,043,816	2,736,944,280

unzip - total memory in use (MB)

	7.10.1	7.8.4
unzip	101	93
unzip3	82	93
unzip4	123	133
unzip5	135	140
unzip6	149	197
unzip7	161	199

unzip - productivity (%)

	7.10.1	7.8.4
unzip	40.0	41.6
unzip3	46.5	47.2
unzip4	48.8	48.1
unzip5	54.4	53.6
unzip6	57.5	52.9
unzip7	66.1	57.2

I can't make any sense of them - it seems that 7.10 is doing better than 7.8... In all cases productivity is very poor.

jstolarek commented 9 years ago

As another tests I checked how much call arity analysis affects compilation time. I compiled Zips module using GHC 7.10.1 with -O and -O -fno-call-arity. Compilation time went down from 1m36s to 1m30s, so not much.

jstolarek commented 9 years ago

Now that match flattening is reverted the problem is gone. Closing in favour of #96.

goldfirere commented 9 years ago

Just to clarify a few points made above:

I'm not concerned about a difference between 7.8 and 7.10 for this problem. I believe the problem arose due to match flattening, so a comparison would be between match-flattening and not-match-flattening. But I don't think that comparison is necessary: one series of numbers should be all we need.
Less importantly, the change in pretty-printing is due to a change in th-desugar (goldfirere/th-desugar#22, to be exact), not a change in GHC.
It's important to note any growth in the size of the code we're passing to GHC. It's possible GHC is behaving well but that match-flattening causes havoc.
Measuring bytes allocated is usually a good indicator when looking for problems. Look beyond this number only when it's demonstrated that the value does not correlate with actual build times.

I'm going to reopen this ticket to continue to track progress, but only in the version with match flattening. The problem described here is an important part of #113, but not necessarily the only part of #113, so it's worth having a separate ticket.

goldfirere / singletons

Memory leak when compiling Data.Singletons.Prelude.List with match flattening #96