Open alanechang opened 5 months ago
There is no quick solution, we would need PDCE (partial dead code elimination) for that.
The main problem here is that the boxed version is used in multiple places, therefore neither to_cmm
nor flambda2 will currently do anything to that allocation. to_cmm
has a mechanism to duplicate and substitute primitives (such as the integer boxing primitives used in this case) that could in this very specific example be used to achieve the desired result (i.e. pushing the allocation down in each branch), and actually, when compiling in classic mode, the example is indeed simplified as wanted. However, that mechanism blindly substitutes any integer boxing at its use site, essentially replacing one allocation (the original boxing) by n
allocations where n
is the number of times the original allocated value is used. In general, doing this actually degrades code and makes it worse. This is why it is only activated in classic mode and for boxed numbers, where it is meant to roughly imitate what cmmgen
does. It would be a bad idea to do that all the time, so we would need a criterion for when to do that optimization.
That optimisation would need to be able to move and duplicate allocations (and/or other primitives ? @lthls suggested this could be seen as an extension of CSE), but only do so when the evaluation of the primitive at runtime would not be duplicated (this is what PDCE does iiuc), so mostly when it can be pushed down in some branches but not all of them (because in that case it's not interesting). There is likely some design space around the exact heuristic to decide when to do so: whenever not all branches use the primitive result ? with a ratio/difference of these branches ? or maybe when all uses are in "cold" branches ?
In any case, optimizing such code requires a well-crafted heuristic, which itself requires a view of a while function's body. It makes sense to try and do this during the flow analysis we already do, provided that we can do it without slowing things down too much.
Lastly, let's note that such an optimization is better done during simplify/the new global pass, as is the case with most (if not all) optimisations, rather than in to_cmm
which should stay as a "simple" translation: after all it is the goal and design that flambda2 terms are better for such optimisations.
Would be great for integer boxing to be pushed down through conditional branches more so the common path of a function doesn't allocate.
When compiling this toy example:
-O3
produces the following cmm:This code allocates on every call due to
Pbox_int/493
. However, the boxed variable is really only used in theassert false
cases. Would be great forPbox_int/493
to be inlined at the two reference sites.-Oclassic
output for reference:@Gbury Is this something you can help with? Thanks!