Open jagotu opened 2 years ago
Quick guess: the compiler is missing the loop condition probability/loop frequency and so doesn't know if it can be beneficial to peel/etc. I don't know on top of my head if Espresso does this? The low level facility to tell the compiler about probability of a condition is CompilerDirectives.injectBranchProbability(the_condition)
, unless the probability is known upfront you have to count it with some counters, this is what ConditionProfile.createCountingProfile()
does internally, but for compact bytecode interpreter such as Espress/BACIL it may be better to avoid the ConditionProfile
abstraction and use the low level injectBranchProbability
directly.
I don't know if seafoam shows branch probabilities, but that information should be in the IGV format output from the compiler. There's nothing like that in the image, so maybe it is indeed missing (or seafoam does not show it...)
Espresso does some kind of "LivenessAnalysis" which seems to be related to loop tracking. I expected trivial loops to always be beneficial to peel, but it being driven by some branching statistics would explain the issue as BACIL doesn't do any tracking.
For future reference LivenessAnalysis AFAIK is done in order to reduce the number of "live" state, i.e., state that the runtime needs to keep track of in order to be able to deoptimize the compiled code. Espresso eagerly sets slots in the local variables array to null
as early as possible to reduce the "live" state.
While checking the reason for lower performance of some benchmarks, I noticed that if loops remain after TruffleTier (which they seem to do as TruffleTier just peels the bytecode loop), no later tiers unroll them even though form the graphs should fullfill rules for the
LoopFullUnrollPhase
that runs in all three of these tiers. This prevents other optimizations and constant folding that could be achieved by unrolling the loop.For example, this procedure:
Copmiles to this graph after TruffleTier:
And the
LoopNode
survives even after low tier:I've tried to debug the
LoopFullUnrollPhase
and it seems the detection of counted loops happens only once, when the loop is still too complex to be recognized as counted. It could very well be a bug/imperfection in truffle/graal itself, but I don't currently have the time to investigate.Fixing this and making sure that the loops are unrolled should provide a significant boost to the slowest of benchmarks, like
nbody
, which loops over an array of 6 plantes and none of the virtual calls can be constantized due to the lack of unrolling.