Investigate parallelisation of 3 optimisation phases (release-only)

safesparrow commented 1 year ago

Not type-checking related, but I'm doing a short investigation into the following:

In Release mode, there are 3 main optimisation phases/rounds. Round 1 only uses results of round 1 for previous files, round 2 only uses results of round 2 for previous files etc.

This means that we can easily increase parallelisation.

safesparrow commented 1 year ago

I did a PoC, but it seems to break on FCS. I'm not sure why - there is either a dependency between phases I'm not aware of, or there is a bug in scheduling the items/propagating state. Node complaining is file idx 7, phase 3:

safesparrow commented 1 year ago

Update: I had a few bugs, since fixed. FCS compilation now works 🎉

Very first snapshot:

Worth noting the ratio between time spent in each phase 1, 2, 3:

Timings before & after:

# Sequential:
Real: 32.8 Realdelta: 15.8 Cpu: 18.4 Cpudelta: 10.9 Mem: 1445 G0: 1245 G1: 242 G2:  1 [Optimizations]

# Parallel:
Real: 29.3 Realdelta: 12.5 Cpu: 13.2 Cpudelta:  4.2 Mem: 1641 G0: 1250 G1: 258 G2:  1 [Optimizations]

So less speedup than I'd have expected.

EDIT: Actually I forgot to enable Server GC in the latest test. Timings with Server GC:

# Sequential:
Real: 22.9 Realdelta: 12.6 Cpu: 44.5 Cpudelta: 12.8 Mem: 4054 G0:   7 G1:  3 G2:  1 [Optimizations]
# Parallel 1:
Real: 18.5 Realdelta:  8.0 Cpu: 45.8 Cpudelta: 13.7 Mem: 4423 G0:   4 G1:  2 G2:  1 [Optimizations]
# Parallel 2:
Real: 17.4 Realdelta:  7.3 Cpu: 47.8 Cpudelta: 15.6 Mem: 4355 G0:   4 G1:  2 G2:  1 [Optimizations]

safesparrow commented 1 year ago

Bonus quest: The above has a limited potential for parallelisation.

What can speed it up further is a graph-based approach yet again.

For wider context, there are a few different ways optimisation of a file might behave. It might:

not depend on any other files (I think I already found proof that that's not the case, ie. cross-file inlines/other const values get evaluated)
depend only on the files which contain symbols/types it references,
depend on all the files above.
depend on all the files above and all previous phases (for all files in the project).

Case 4. can be ruled out because the code doesn't allow for this kind of dependency.

If:

is true => we could simply parallelise everything the way parsing is parallelised.
is true => we could use dependency tree generated by type-checking and optimize using the graph-based approach.
is true => we cannot do more than the above PoC already does.

For 2. the main issue would be making the optimization methods additive, ie. have the equivalent of AddResultsToTcState for optimization.

EDIT: I had a quick look, and I'm cautiously optimistic that it will be fairly trivial to make file optimization return a delta that can then be combined with others.

safesparrow / fsharp

Investigate parallelisation of 3 optimisation phases (release-only) #28