haskell / haddock

Haskell Documentation Tool
www.haskell.org/haddock/
BSD 2-Clause "Simplified" License
361 stars 243 forks source link

Add `-rtsopts` to the build flags #1565

Closed parsonsmatt closed 2 months ago

parsonsmatt commented 1 year ago

haddock -rtsopts

tuning the gc options can get big improvements in performance when compiling with ghc. since haddock is doing a compilation, it stands to reason that more resources can similarly improve performance.

the following are runs on the work codebase.

without any options

this run used no options aside from +rts -s -rts to dump output:

1,604,466,578,392 bytes allocated in the heap
 386,407,308,656 bytes copied during gc
  12,648,347,528 bytes maximum residency (79 sample(s))
      59,820,152 bytes maximum slop
           36014 mib total memory in use (0 mb lost due to fragmentation)

                                     tot time (elapsed)  avg pause  max pause
  gen  0     373677 colls,     0 par   169.848s  170.071s     0.0005s    0.0061s
  gen  1        79 colls,     0 par   138.591s  138.638s     1.7549s    11.6650s

  tasks: 5 (1 bound, 4 peak workers (4 total), using -n1)

  sparks: 0 (0 converted, 0 overflowed, 0 dud, 0 gc'd, 0 fizzled)

  init    time    0.001s  (  0.000s elapsed)
  mut     time  551.120s  (619.349s elapsed)
  gc      time  308.439s  (308.709s elapsed)
  exit    time    0.004s  (  0.003s elapsed)
  total   time  859.564s  (928.060s elapsed)

  alloc rate    2,911,282,964 bytes per mut second

  productivity  64.1% of total user, 66.7% of total elapsed

time output shows 15:30 runtime.

without rtsopts, with -j20

my laptop has 20 cores, so running with -j20 gives a decent boost:

1,603,850,165,040 bytes allocated in the heap
 400,668,426,664 bytes copied during gc
  12,212,396,328 bytes maximum residency (76 sample(s))
      58,748,632 bytes maximum slop
           34914 mib total memory in use (0 mb lost due to fragmentation)

                                     tot time (elapsed)  avg pause  max pause
  gen  0     218665 colls, 205375 par   377.040s  169.684s     0.0008s    0.0236s
  gen  1        76 colls,    66 par   604.065s  78.588s     1.0341s    6.0503s

  parallel gc work balance: 49.53% (serial 0%, perfect 100%)

  tasks: 74 (1 bound, 73 peak workers (73 total), using -n20)

  sparks: 6814 (952 converted, 0 overflowed, 0 dud, 1609 gc'd, 4253 fizzled)

  init    time    0.001s  (  0.000s elapsed)
  mut     time  1163.585s  (336.423s elapsed)
  gc      time  981.105s  (248.272s elapsed)
  exit    time    1.219s  (  0.005s elapsed)
  total   time  2145.909s  (584.701s elapsed)

  alloc rate    1,378,369,787 bytes per mut second

  productivity  54.2% of total user, 57.5% of total elapsed

unfortunately, we run afoul of the garbage collector. productivity is down significantly. we're still considerably faster - time shows 9:41 runtime, a saving of 5:19, 37.5% improvement!

with rtsopts

this invocation used +rts -s -n2m -a128m -qg -rts. i did not tune this in anyway - just the first things i try when i'm playing with rts options for performance.

1,602,827,917,096 bytes allocated in the heap
 212,897,340,048 bytes copied during gc
  12,731,140,720 bytes maximum residency (29 sample(s))
      60,183,952 bytes maximum slop
           36156 mib total memory in use (0 mb lost due to fragmentation)

                                     tot time (elapsed)  avg pause  max pause
  gen  0     11654 colls,     0 par   127.812s  127.867s     0.0110s    0.1192s
  gen  1        29 colls,     0 par   86.436s  86.458s     2.9813s    11.5267s

  tasks: 5 (1 bound, 4 peak workers (4 total), using -n1)

  sparks: 0 (0 converted, 0 overflowed, 0 dud, 0 gc'd, 0 fizzled)

  init    time    0.002s  (  0.001s elapsed)
  mut     time  608.524s  (677.754s elapsed)
  gc      time  214.247s  (214.326s elapsed)
  exit    time    0.004s  (  0.010s elapsed)
  total   time  822.777s  (892.090s elapsed)

  alloc rate    2,633,959,982 bytes per mut second

  productivity  74.0% of total user, 76.0% of total elapsed

time output is 14:54 this time - a modest 36 seconds saved, only 3.9% improvement. still, free speed is free speed.

with rtsopts, with -j20

let's add -j20 to the program, so it runs with all my cores.

1,601,853,408,352 bytes allocated in the heap
 130,407,095,128 bytes copied during gc
  12,556,216,824 bytes maximum residency (24 sample(s))
      59,799,048 bytes maximum slop
           37863 mib total memory in use (0 mb lost due to fragmentation)

                                     tot time (elapsed)  avg pause  max pause
  gen  0      1196 colls,     0 par    1.665s  89.695s     0.0750s    0.6673s
  gen  1        24 colls,     0 par   10.674s  54.364s     2.2652s    11.3638s

  tasks: 78 (1 bound, 77 peak workers (77 total), using -n20)

  sparks: 6814 (254 converted, 0 overflowed, 0 dud, 13 gc'd, 6547 fizzled)

  init    time    0.001s  (  0.001s elapsed)
  mut     time  983.777s  (282.517s elapsed)
  gc      time   12.339s  (144.059s elapsed)
  exit    time    1.102s  (  0.003s elapsed)
  total   time  997.219s  (426.580s elapsed)

  alloc rate    1,628,268,955 bytes per mut second

  productivity  98.7% of total user, 66.2% of total elapsed

time reports 7:09. this is 2:32 improvement over the prior parallalel run, another 26% improvement.

overall, we're looking at a 37.5% improvement, just on tuning the runtime parameters a bit.

Kleidukos commented 2 months ago

Hi, thank you for this PR, but Haddock now lives full-time in the GHC repository! Read more at https://discourse.haskell.org/t/haddock-now-lives-in-the-ghc-repository/9576.

Let me know if you feel it is still needed, and I'll migrate it. :)