goldfirere / singletons

Fake dependent types in Haskell using singletons
286 stars 36 forks source link

Memory leak when compiling Data.Singletons.Prelude.List with match flattening #96

Open jstolarek opened 10 years ago

jstolarek commented 10 years ago

I just compiled latest singletons and compiling Data.Singletons.Prelude.List allocated 1.5GB of RAM (and took ~20 minutes on my old machine). Did this use to happen?

Based on my experience I guess that adding -O0 will fix the issue. Alternatively, we could try to have smaller splices and see if this changes anything.

jstolarek commented 10 years ago

As expected, with -O0 the module compiles in less than a few seconds and uses only 350MB of RAM. Still I wonder when this started happening.

goldfirere commented 10 years ago

I think something is confused here. It's Data.Singletons.Prelude.List that takes forever, not Data.Promotion.Prelude.List.

And, I'm pretty sure it's the high-arity zip/zipWith/unzip functions that are doing it. I don't think it's the size of the splices, because GHC reports errors quite quickly, implying that singletons code is done running.

goldfirere commented 10 years ago

(This problem started when I moved a ton of functions from promotion-only to singletonizing, allowed now because of the work on #30.)

jstolarek commented 10 years ago

Yes, I meant Data.Singletons.Prelude.List.

(This problem started when I moved a ton of functions from promotion-only to singletonizing, allowed now because of the work on #30.)

My guess was that this is related to #30 but my hypothesis was that this was due to changes in desugaring. But you may be right that this is caused by zips. I'll investigate.

jstolarek commented 9 years ago

I have just been bitten by this one again. This time - in conjunction with #10228 - I was unable to compile singletons on one of my machines. Richard, your suggestion about zips seems correct. I extracted all zip*, zipWith* and unzip* functions into a separate module but I was only able to compile it with -O0. With optimisations turned on I ran out of memory and had to terminate the build. However, compiling Data.Singletons.Prelude.List still uses too much memory and again I had to abort the build. We need to do something about this. A proper solution would be to investigate what's going on during compilation. But I'm not sure how we could do that. Is profiled build of GHC a solution here?

A not-so-proper solution would be to: a) extract zips into a separate module; b) disable optimisations both for that module and Data.Singletons.Prelude.List module. That does not eliminate the causes but it at least hides the symptoms.

Richard, your thoughts?

goldfirere commented 9 years ago

I think the workaround you suggest is appropriate for singletons, but we should also isolate this problem in a way that we can report it to GHC. If the zips are the problem, is it possible to collect statistics based on the arity of the zip? For example, what growth function do you see as you increase the arity of the zip?

The fact that optimizations matter tells me that it's not typechecking that's holding us up. What does the core look like (with -ddump-simpl)? Is this caused by flattening out matches? (For example, does the code passed to GHC -- the TH output -- grow exponentially?) I definitely think we narrow this down with some effort.

jstolarek commented 9 years ago

Optimisations matter but even without them the situation does not look too good. For now I extracted zips into a separate module and disabled optimisations - travis build no loner stalls under GHC 7.10. Will investigate further.

jstolarek commented 9 years ago

I measured compile times with GHC 7.10 and size of generated Core for zipWith and uznip functions. It's hard to say what the growth function for compile times is. It is certainly not linear, but other than that it can be quadratic, cubic or exponential - with so few data points it's really hard to tell. You can go to WolframAlpha and see the plots with these commands:

-- zipWith
linear fit {2,1.16},{3,1.43},{4,1.75},{5,2.37},{6,3.25},{7,4.55}
quadratic fit {2,1.16},{3,1.43},{4,1.75},{5,2.37},{6,3.25},{7,4.55}
cubic fit {2,1.16},{3,1.43},{4,1.75},{5,2.37},{6,3.25},{7,4.55}
exponential fit {2,1.16},{3,1.43},{4,1.75},{5,2.37},{6,3.25},{7,4.55}

--unzip
linear fit {2,1.3},{3,1.53},{4,2.0},{5,2.53},{6,3.2},{7,4.18}
quadratic fit {2,1.3},{3,1.53},{4,2.0},{5,2.53},{6,3.2},{7,4.18}
cubic fit {2,1.3},{3,1.53},{4,2.0},{5,2.53},{6,3.2},{7,4.18}
exponential fit {2,1.3},{3,1.53},{4,2.0},{5,2.53},{6,3.2},{7,4.18}

With zipWith3 I noticed that using overlapped patterns seems to be faster than listing all possible patterns without overlap - the difference is only around 0.1s but it seems to be pretty consistent.

Sizes of generated Core are at least quadratic, possibly cubic, but they are not exponential:

-- zipWith
linear fit {2,236790},{3,554413},{4,1097527},{5,1931507},{6,3131749},{7,4766252}
quadratic fit {2,236790},{3,554413},{4,1097527},{5,1931507},{6,3131749},{7,4766252}
cubic fit {2,236790},{3,554413},{4,1097527},{5,1931507},{6,3131749},{7,4766252}
exponential  fit {2,236790},{3,554413},{4,1097527},{5,1931507},{6,3131749},{7,4766252}

-- unzip
linear fit {2,389963},{3,741555},{4,1266789},{5,2001982},{6,3004027},{7,4300641}
quadratic fit {2,389963},{3,741555},{4,1266789},{5,2001982},{6,3004027},{7,4300641}
cubic fit {2,389963},{3,741555},{4,1266789},{5,2001982},{6,3004027},{7,4300641}
exponential fit {2,389963},{3,741555},{4,1266789},{5,2001982},{6,3004027},{7,4300641}

Clearly much of the fault for long compile times is on our side. But still GHC performs worse than it used to.

Next: same measurements using GHC 7.8.4

jstolarek commented 9 years ago

I did the measurements for GHC 7.8.4 and the results are not at all what I expected.

Compile times (in seconds):

7.10.1 7.8.4
zipWith 1.16 1.05
zipWith3 1.43 1.30
zipWith4 1.75 1.80
zipWith5 2.37 2.32
zipWith6 3.25 3.20
zipWith7 4.55 4.15
7.10.1 7.8.4
unzip 1.3 1.2
unzip3 1.53 1.48
unzip4 2.00 1.95
unzip5 2.53 2.45
unzip6 3.2 3.1
unzip7 4.18 4.1

Core size (in bytes):

7.10.1 7.8.4
zipWith 236790 267391
zipWith3 554413 615616
zipWith4 1097527 1201706
zipWith5 1931507 2090045
zipWith6 3131749 3356677
zipWith7 4766252 5070676
7.10.1 7.8.4
unzip 389963 440257
unzip3 741555 817391
unzip4 1266789 1373322
unzip5 2001982 2146727
unzip6 3004027 3191546
unzip7 4300641 4536226

Compile times as slightly shorter for GHC 7.8.4. The difference is barely noticeable but seems consistent. Size of generated Core is larger for GHC 7.8.4 but I think the reason is that pretty-printer in 7.10 is less verbose, eg. it prints a ~ b instead of (GHC.Types.~) a b. So the generated Core seems to be of similar size.

Not sure what to do next. My hypothesis at the moment is that GHC 7.10 allocates more memory and that causes a noticeable slowdown when compiling our large modules (it might be causing swap usage or increase GC activity). I thought about measuring that by passing +RTS -s to GHC during compilation. This spits out a lot of stats and I'm not sure which ones are relevant for us:

   2,813,779,864 bytes allocated in the heap
     430,103,464 bytes copied during GC
      78,479,584 bytes maximum residency (10 sample(s))
       2,408,152 bytes maximum slop
             161 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       324 colls,     0 par    0.900s   0.904s     0.0028s    0.0480s
  Gen  1        10 colls,     0 par    0.447s   0.450s     0.0450s    0.1750s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    2.634s  (  2.967s elapsed)
  GC      time    1.348s  (  1.354s elapsed)
  EXIT    time    0.031s  (  0.032s elapsed)
  Total   time    4.045s  (  4.354s elapsed)

  Alloc rate    1,068,213,806 bytes per MUT second

  Productivity  66.7% of total user, 61.9% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

I think we should look at total bytes allocated, total memory in use and time spent in GC. Thoughts?

jstolarek commented 9 years ago

Some more numbers:

zipWith - bytes allocated in the heap

7.10.1 7.8.4
zipWith 422,458,240 482,987,584
zipWith3 610,538,208 673,563,488
zipWith4 906,088,024 967,362,248
zipWith5 1,342,030,560 1,395,288,784
zipWith6 1,966,988,472 1,992,980,232
zipWith7 2,815,774,192 2,796,970,504

zipWith - total memory in use (MB)

7.10.1 7.8.4
zipWith 99 92
zipWith3 100 93
zipWith4 99 126
zipWith5 145 137
zipWith6 134 201
zipWith7 236 203

zipWith - productivity (%)

7.10.1 7.8.4
zipWith 35.1 37.5
zipWith3 42.5 44.7
zipWith4 49.3 46.1
zipWith5 50.9 52.2
zipWith6 57.4 52.1
zipWith7 57.7 55.8

unzip - bytes allocated in the heap

7.10.1 7.8.4
unzip 532,047,816 594,717,456
unzip3 750,687,832 802,940,216
unzip4 1,058,652,952 1,099,362,272
unzip5 1,481,594,328 1,503,498,152
unzip6 2,057,421,224 2,034,910,696
unzip7 2,815,043,816 2,736,944,280

unzip - total memory in use (MB)

7.10.1 7.8.4
unzip 101 93
unzip3 82 93
unzip4 123 133
unzip5 135 140
unzip6 149 197
unzip7 161 199

unzip - productivity (%)

7.10.1 7.8.4
unzip 40.0 41.6
unzip3 46.5 47.2
unzip4 48.8 48.1
unzip5 54.4 53.6
unzip6 57.5 52.9
unzip7 66.1 57.2

I can't make any sense of them - it seems that 7.10 is doing better than 7.8... In all cases productivity is very poor.

jstolarek commented 9 years ago

As another tests I checked how much call arity analysis affects compilation time. I compiled Zips module using GHC 7.10.1 with -O and -O -fno-call-arity. Compilation time went down from 1m36s to 1m30s, so not much.

jstolarek commented 9 years ago

Now that match flattening is reverted the problem is gone. Closing in favour of #96.

goldfirere commented 9 years ago

Just to clarify a few points made above:

I'm going to reopen this ticket to continue to track progress, but only in the version with match flattening. The problem described here is an important part of #113, but not necessarily the only part of #113, so it's worth having a separate ticket.