Closed sadiqj closed 3 years ago
Testsuite passes with the debug runtime on Linux/AMD64 for me locally. All but one of the CI jobs is failing because there's only support for AMD64.
One of the builds is failing because there's potentially an interaction between this PR and Flambda which causes the included polling insertion test to fail, am investigating.
Feeley’s algorithm (non-mimimal)
Referenced earlier, Feeley’s algorithm for balanced polling involves placing poll points in a way that gives a maximum tolerance for the number of instructions between polls. This is useful for being able to guarantee timely delivery of interrupts but can require placing poll points between arbitrary operations.
Since poll points can result in a garbage collection which can run arbitrary OCaml code or raise asynchronous exceptions, placing poll points between operations can break existing assumptions about atomicity. Even if we ignore existing code that relies on these assumptions, it is unclear what the interface should be for users wishing to write code containing critical sections and how that might interact with the flambda optimisation passes.
Since introducing polling points at arbitrary points in the source program is indeed a problem, could feeley's algorithm instead be used as a check, to verify that there is never more than some unit of time/code space between two safe points (or even to compute the uppper bound) ? The compiler could then emit a warning if there exists a path of more than some measure of length that is without safe points (that measure maybe being configurable through a command line option). Add to that a primitive to introduce polling points in source code so that user can manually resolve such warnings, and it could probably make a decent compromise (though I don't know if that would need to be in addition to or as an alternative to the already implemented safe point introduction mechanism).
Have you thought/tried something along these lines, or did I miss something ?
I looked at Feeley's paper. If I understand correctly, he mentions a "naive" way to insert safepoints, at the entry and return of each procedure (including loop bodies), and then a "clever" way to insert them by allowing a certain latency between polls. This PR implements the "naive" approach. Using the "naive" approach works around an issue with a previous Mach-level implementation, which was that safepoints would sometimes be inserted at places where GC-unsafe temporaries were alive, which is incorrect. If we assume that the optimizer has already aggressively optimized functions, then in many cases (... but not necessarily in tight loops or tight recursive functions) the naive approach should be fine.
Some questions:
Some more questions:
caml_call_gc
? Those entrypoints are a pain to code in asm. See also #8805.I looked at Feeley's paper. If I understand correctly, he mentions a "naive" way to insert safepoints, at the entry and return of each procedure (including loop bodies), and then a "clever" way to insert them by allowing a certain latency between polls.
So there's actually three different ways mentioned in Feeley's paper. There's the naive call-return that he contrasts with, the balanced polling algorithm which limits the number of instructions between polls and then minimal polling, which is balanced polling but where the limit is effectively unlimited.
The insertion strategy here, along with the optimisation for leaf functions, are effectively minimal polling but without the potential poll before return.
* Why are we also polling when exceptions are raised? This sounds like a return to me, and you are not polling at return points.
This was mainly to simplify the analysis for eliding polls as it meant not needing to deal with exceptional control flow. Having a ponder just now though, we may be able to avoid this if raising an exception no longer made a function a leaf.
* In the past we discussed placing performing this transformation at the Lambda level, rather than at the Mach level. Has someone tried to do this, to compare the results? (It sounds easier at the Lambda level to use user-placed attributes to control polling; they could still be elided at the [Closure|Flambda] or Mach level if it is easier.)
I did actually also do an implementation of safepoints at lambda (https://github.com/sadiqj/ocaml/tree/safepoints_lambda has a work in progress) branch which could be brought up to speed. My initial findings on that branch though were that we ended up with a superset of the polls at Mach, with the extra being from functions that ended up inlined.
* In your graph with horizontal bars, what does the X axis mean? When it says "1.1", which ratio is that?
Those numbers are runtime, normalised to the 4.11 branch. So 1.1 would be a 10% increase in runtime over 4.11.
- Did you measure "hard" execution counts such as total number of instructions executed? These counts don't fully correlate with execution speed, but are independent of secondary effects such as code placement, so I find them useful to get another idea about the impact of a compiler change.
These are actually what are shown in the table under the code size graph. The numbers are retired instructions from the performance counters. I'll make that clearer in the headings.
* Could we please avoid introducing another runtime entry point beyond `caml_call_gc`? Those entrypoints are a pain to code in asm. See also #8805.
I can understand they're a pain. I'll investigate. I think the tricky bit is disambiguating a poll from an allocation.
Since introducing polling points at arbitrary points in the source program is indeed a problem, could feeley's algorithm instead be used as a check, to verify that there is never more than some unit of time/code space between two safe points (or even to compute the uppper bound) ? The compiler could then emit a warning if there exists a path of more than some measure of length that is without safe points (that measure maybe being configurable through a command line option). Add to that a primitive to introduce polling points in source code so that user can manually resolve such warnings, and it could probably make a decent compromise (though I don't know if that would need to be in addition to or as an alternative to the already implemented safe point introduction mechanism).
Have you thought/tried something along these lines, or did I miss something ?
I haven't considered this approach, no. So far the idea here has been to have as automatic a process as possible whilst trying to break as little existing code as we can.
It's also worth pointing out here that for the outstanding trunk issues safepoints would solve and also for multicore, the tolerable time between polls can be pretty high. Even high tens of thousands of instructions would suffice.
Tests should now pass on Flambda, I've also pulled @dra27 's github actions.
The AppVeyor failure is now real - the changes to amd64.S will need making the amd64nt.asm as well which I expect is the problem. You want to disable the i386-static workflow as well.
This looks nice, and the limited set of possible polling positions sounds like a great idea. I'm surprised by the code size impact, though - 6-11% is much higher than I thought this would be.
I think it might be possible to remove most of these polls by using a different criterion. Consider this function:
let f x =
M.blah x;
x + 1
It has no allocation and is not a leaf function. Yet I claim no poll is needed here: the call to M.blah
is a direct call to a function previously defined in another module, which has already had polling inserted if necessary.
Slightly more formally: I think the goal is to ensure that evaluation of any OCaml expression eventually either polls (via allocation or inserted polls) or terminates (normally or with an exception), so that it does not cause hangs. In other words, we must ensure that every infinite sequence of evaluation steps contains a poll.
Any such infinite sequence must contain infinitely many calls to some Mach function or iterations of some loop, so putting polls on all function and loop entrypoints would suffice (but is expensive). To optimise, let's consider the function case. We can make a partial ordering over Mach functions: in definition order within a compilation unit, and dependency order between compilation units. This order is acyclic, as OCaml does not allow circular dependencies between modules. It's also finite, so if there's an infinite sequence of function calls then something has to make a forward reference.
Also, in such an infinite sequence of function calls, at most finitely many of them can be non-tail calls. (If there are infinitely many non-tail calls, then the program soon terminates with a stack overflow).
So, every such infinite sequence must contain many forward-referencing tail calls, so polling only on those suffices. In other words, a function entrypoint needs a poll only if there is some path through the function which:
"may be a forward reference" is defined as being an indirect call, or a direct call to a function defined later in the same compilation unit.
The reasoning here is a bit tricky, but it could remove a lot of polls. I've had a go at hacking up a version of this patch that uses this criterion - I strongly suspect it's buggy (I think there's something weird with liveness around polling points?), so YMMV. But on ocamlopt.opt
, it inserts 80-90% fewer poll points than the current strategy.
One key design choice of this PR is to introduce polling points at few predictable locations, in order to affect as little as possible the current reasoning on polling points, and avoid as much as possible breaking existing code. I find this choice, and that of having an analysis to elide poll in Mach very pragmatic. I have critical sections in two draft PRs about masking that are a bit more subtle than the ones in the janestreet code mentioned in this PR. The result is that no adaptation is required for these either. (Some test code will be broken, but some is already broken in bytecode due to polling at function call.)
I am interested in the two points “Alternative approaches: Instrumentation at Lambda” and “Things to consider: Disabling polling (via an annotation or otherwise)”.
If the user wants to reason about polling locations, specifying the behaviour of polling point insertion in terms of the back-end is not enough. There are at least some implicit assumptions on how the polling locations in Mach relate to the source code. As it turns out, it is already tricky to guarantee that flambda respects atomicity (see #10035, though I did not understand if it could break atomicity already or only allowed as a future possibility by flambda's model). So flambda would already benefit from an annotation with which the user could request a strict interpretation of polling locations, in addition to making things clearer from a language specification point of view.
My use-case is targeted by the comment "disabling the action [of polling locations] is sufficient", i.e. with masking. I investigate abstractions for resource-management that, among others, correctly deal with asynchronous exceptions at no added cost, motivated by multicore taking the path of adding a primitive notion of first-class resource with fibers currently incompatible with said asynchronous exceptions. To implement such abstractions so far I can say that one needs both masking and control on polling locations for some critical sections—masking alone is not enough. Most is already addressed by the limited number of polling locations you introduce. It would also make things easier if an annotation let us remove the polling location at the start of functions (simply annotating a begin...end
block might miss this, depending on how it is specified, whereas an annotation on functions could be simpler to specify). I can also see future uses of controlling the polling before raise (but only for re-raising an exception).
Emitting potential polling locations at Lambda gives a path to bring bytecode in line with native in terms of polling behaviour. Eliding polling in Mach would still be necessary, but the hardcoded polling locations in bytecode could be removed. Together with an annotation that locally disables the emission of polling locations at Lambda (and properly understood by flambda), native and bytecode would benefit from the same atomicity guarantees.
I would also like to clarify:
@gadmm has some thoughts on this. He has made the suggestion that poll points should not result in callbacks. This would obviate the need for disabling polling in critical sections for correctness. It would not solve the potential performance aspect though and would also leave issues such as #3747 unsolved. For multicore's need this may suffice though.
To summarise I see five separate issues to fix:
The issues are ordered by difficulty. My point is that only 5. really requires any change to OCaml's behaviour regarding asynchronous callbacks, with associated compatibility risks. For the other ones it is enough to just call caml_check_urgent_gc
instead of caml_process_pending_actions
inside caml_poll
(leaving aside your point on performance). I think poll points should result in callbacks (so that 5. is fixed eventually), but I was pointing out that one can get there in several steps, by not introducing any potential breaking change (5.) before a way to control polling locations has been around long enough (4.), and by having the first steps before like this PR (1. and 2.).
I can also see future uses of controlling the polling before raise (but only for re-raising an exception).
This sounds simple enough to fix: we could never insert polling when the %reraise
primitive is used. (I'm assuming the distinction between raise
and reraise
is still present in Mach, which may not be the case). One should use reraise
when catching an exception that was just raised, so if raises poll it is clear that there was a poll shortly before. (Note that it is still not clear to me why raises should poll at all, if we consider that returning a value need not poll.)
Note that it is still not clear to me why raises should poll at all
There's a trade-off here. Consider this function:
let rec retry f x =
try Some (f ())
with Not_found -> retry f x
Every non-exceptional path through this function allocates, but it is possible to end up in a non-allocating loop with retry (fun _ -> raise Not_found)
.
There are two options that ensure sufficient polling:
retry
, because it has a non-allocating loop on the exceptional pathraise
, so that exceptional paths always poll@stedolan thanks for the example. When you say in f
, I'm not sure whether you meant in the prelude of retry
, or right before the call f ()
. Note that it would also be possible to have it before the recursive call retry f x
.
When you say in f, I'm not sure whether you meant in the prelude of retry, or right before the call f ().
Typo! Fixed now.
(When fixing this, I accidentally clicked "mark as ready for review", which seems to be an irrevocable mistake :( )
If we see this function as
let rec retry f x =
match f x with
| v -> Some v
| exception Not_found -> retry f x
it is very natural to "push" the poll in both branches, and elide it in the value case, giving an efficient poll placement.
it is very natural to "push" the poll in both branches, and elide it in the value case, giving an efficient poll placement.
Yes, the downsides of doing this though is that users can no longer easily reason about where polls might appear. With the current rules, you should assume there's always a poll in the function prologue, loop backedge and when raising an exception - they just might be removed. This gives us room to add or remove polls at a later time without breaking existing assumptions e.g both the PR's original optimisations and Stephen's one wouldn't change the atomicity behaviour of code which took those placements in to consideration.
If I'm understanding correctly, to be able to do an optimisation like this match one we would also need to extend that to polls at the start of each arm of a match.
My reasoning was as follows:
match
is pure (no computation), then pushing the poll inside the branches is not an observable changematch
is an expression that may poll itself, then polling right after the match is not wrongThe problem with this reasoning is the "may poll" part: if we know for sure that the scrutinee polls, we can elide all polls. But in this case with f x
, the user may reason about the particular function f
they are passing to retry
and they may know that it polls, or that it does not, and in the latter case the move is not valid.
Yes, the downsides of doing this though is that users can no longer easily reason about where polls might appear.
I can see why this is an attractive property, and I'm not proposing to do differently now, but in the long run I think we should move to a model where the user should assume that a poll may happen anywhere and then provide an explicit construct to prevent polling.
There is another advantage to inserting polls only at the start of functions: it inserts fewer polls than pushing them downward, which may sometimes duplicate polls. In the original analysis at the top of this PR, the performance overhead was mostly within measurement noise while the code size overhead was significant, so focusing on code size seems reasonable.
Since there is a belief that the delay between polling location will be negligible, there is a feature at #8961 that could interest you. It makes is possible to set up a signal handler in such a way that all polling locations are explored, by re-recording the signal from the handler itself. This has proved invaluable in tests. It might be possible to use it to get empirical data on the delay between polling locations.
(I’ve pushed a new branch to this PR - if people have made comments on specific bits of code earlier on then they may disappear. I did a pass before the push and they were all either addressed in a comment or will be in the text below. The 4.11.2 in the branch name is inaccurate - it is now against trunk)
This is an update to the polling implementation. There are a bunch of changes from the earlier implementation.
New function prologue poll eliding algorithm
I have changed the function prologue poll eliding algorithm to the one suggested by @stedolan. To borrow his summary, we only put polls in function prologues if there is a path through the function which:
This has a significant effect on codesize impact which is now 2.2% to 3.6% on AMD64 (down from 6-11% in the previous implementation). Benchmarks below.
(I have also ported this prologue poll eliding algorithm to multicore and @shubhamkumar13 was able to run our multicore parallel benchmarks. We are satisfied that this implementation is sufficient for multicore scalability)
Rebased to trunk
To run the widest range of benchmarks and to allow users to test on private codebases the initial PR was against the 4.11. This implementation has now been rebased against trunk which necessitated a little bit of work in selectgen due to the removal of spacetime.
Removed extra runtime entry
I’ve removed the _caml_callpoll runtime entry that @xavierleroy requested. Instead we use the allocation debugging data to indicate a poll has occurred. This now means there are no asm changes required and simplifies new architectures significantly.
arm64 support
This implementation adds arm64 architecture support. This was slightly more complicated than amd64 due to the need for a far poll (to mirror far alloc). I think my branch relaxation support code is correct but it might be worth someone double checking the maths. Benchmarks below.
i386 support
Also included is i386 support. The paucity of registers smoked out a bug in the last implementation where polls could appear between a move to a physical register and the subsequent raise.
Windows support
Removing the _caml_callpoll runtime entry meant only small changes were needed to get the Windows build to work and this implementation now passes on the 32-bit and 64-bit Appveyor builds.
Codegen cleanups
In the previous implementation the back edge of each loop on amd64 would look like:
cmp (%r14),%r15
jg <loop head>
jmp <poll in epilogue>
jmp <loop_head>
This has now been cleaned up (for both AMD64 and arm64). The code in the epilogue now jumps directly to the loop head and the last unconditional loop is no longer emitted. This saves 2-6 bytes per loop on AMD64 and 4 bytes on arm64.
As this branch is now rebased to trunk the selection of benchmarks from sandmark are a little more limited. I rebased the nopnop patch used in the earlier performance measurements onto the trunk and also added support for arm64.
I ran benchmarks on sherwood (an AMD EPYC 7702) and thunderx (a Cavium ThunderX CN8890). For sherwood I ran 64 iterations of nopnop, for thunderx I only ran 16 iterations as just 16 took 2 days of wallclock time.
The CN8890 is a pretty old processor, we’re hoping to have some new Neoverse-based benchmarking setups available soon and I will provide benchmarks on those for the completed PR.
Interesting but not unexpected sidenote, the ThunderX doesn’t have the frontend funkiness of the Epyc. The alignment of functions seems to make less difference on the ThunderX though it’s worth pointing out that there’s about an order of magnitude performance difference between the ThunderX and the Epyc which could probably mask a lot.
I’d like to get some feedback on the revised implementation, especially the bits that affect per-architecture changes before I go through the remaining architectures.
Specifically this would be anything that impacts the two changes to the Mach IR, the Op Ipollcall and the new test Ipolltest.
The remaining architectures to be implemented are:
It's not obvious to interpret the runtime-performance graphs. Do I understand correctly?
With this analysis ind mind, there are 5 benchmarks on the ARM processor (little sensitivity to nop padding) where the PR overhead is above 5%, including one above 10%, and well above the blue range, and there is one benchmark on the AMD processor (high sensitivity to nop padding) where the overhead is above 10% and in the upper limit of the blue range. Do we know what makes those specific benchmarks more sensitive to this change than the others?
Thanks for taking a look Gabriel.
Yes, the graphs are produced in the same way the first ones in the PR were. The brown marks correspond to the single execution of the PR (averaging won't help here, subsequent runs will still have the same layout).
Unfortunately the ThunderX is a pretty old system and doesn't have a working perf subsystem, so I've not been able to gather performance counters or do profiling to work out what's going on. We have some more modern arm64 systems arriving soon.
On the AMD64 system, I had a look qr_decomposition. The main difference between polling and trunk is that caml_apply3 gets a poll. It spends ~95% of it's time in a very hot path (dot -> fold_left2 -> caml_apply3) and is pretty branch heavy (~19% of instructions on trunk, 21% on this PR) so I suspect it is very susceptible to anything that might change the alignment of those jump targets. The performance counters show the instruction cache pipeline is stalled significantly longer (~28% more) but there doesn't seem to be more granular performance counters on the Epyc to dig much deeper.
Essentially if your hot code paths are small tight loops then the alignment of those loops can heavily influence performance.
A question for discussion on a separate topic. Do we want to add safepoints on 32-bit platforms? For example, the lack of registers on i386 means the code size impact is significant. On arm 32-bit it looks like we might have to either carve out a temporary register or save/restore, neither of which seem ideal.
The main difference between polling and trunk is that caml_apply3 gets a poll.
Interesting; I would expect the caml_apply<N>
and caml_curry<N>
to be very hot in many examples. Would performances benefit from ruling them out of the poll-insertion logic, or would we just get a poll in another part of the same hot loops, resulting in the same performance?
Looking at caml_apply<N>
in isolation, the generated Mach IR involves an indirect tail call which is why there's a poll. The indirect tail call could be to a function that doesn't have a poll.
The caml_curry<N>
family that allocate are unmodified since there's no need to poll there. The ones that don't allocate (e.g caml_curry_4_3
) do get a poll due to the indirect tail call.
That said - I haven't yet managed to use mutually tail recursive functions to get the compiler to produce an output where the only poll is in caml_apply<N>
. They either involve allocation or the compiler doesn't need to resort to a caml_apply
. That's not to say it's not possible.
That said - I haven't yet managed to use mutually tail recursive functions to get the compiler to produce an output where the only poll is in
caml_apply<N>
. They either involve allocation or the compiler doesn't need to resort to acaml_apply
. That's not to say it's not possible.
You could try something like:
let[@inline never][@local never] app f x y = f x y
let rec f x y = app g x y
and g x y = f x y
You could try something like:
Thanks, that would do it. Inspecting the Mach that code results in, the only poll encountered would be in caml_apply2
.
More naive comments.
In my mental model, caml_apply<N>
are low-level constructs that are not part of the natural operational semantics of OCaml programs, but inserted to enable a more efficient implementation of n-ary applications. As a thought experiment, it should be possible to take a program before caml_apply<N>
are inserted, insert polls in it at this level, and then insert the caml_apply<N>
. This suggests that there should always exist some poll-insertion strategies that never poll into those primitives. (I have no idea whether they are better than the current one).
One reason to avoid polling inside them would be that they are called by many different code paths, and only a fraction of them require polling; in a sense they are a control-flow bottleneck. So polling in them inevitably inserts redundant polls in many control-flow paths.
Another, simpler approach (if we wanted to avoid polling in them) would be to have two variants of the function, one that polls and one that does not. Then the poll-insertion code can rewrite their calls to whichever variant is appropriate in a given context.
Finally: before doing any of that, the question is whether there would be any performance benefit there. I don't have any intuition about that (that was my question). But it would be possible to test by just removing any polls inside those functions, and running benchmarks. (This of course may break the regular-polling guarantees, that version would be wrong but it gives an upper bound on potential gains going in this direction.)
More naive comments.
I think there are ways we could avoid putting polls in caml_apply<N>
and caml_curry<N>
but I think they'd come at the cost of significantly more invasive changes (e.g polling analyses appearing at multiple compiler layers) or costly analyses at Mach.
For the latter case, looking at @lthls 's example, I can't see how we'd substitute a non-polling caml_apply<N>
without some kind of interprocedural analysis if only operating at the Mach layer.
Finally: before doing any of that, the question is whether there would be any performance benefit there. I don't have any intuition about that (that was my question). But it would be possible to test by just removing any polls inside those functions, and running benchmarks. (This of course may break the regular-polling guarantees, that version would be wrong but it gives an upper bound on potential gains going in this direction.)
I think something worth pointing out here is that we're already looking at a performance impact that's essentially within measurement noise on AMD64. It's unclear whether introducing more complexity in the polling implementation to reduce instruction count is going to be a good-tradeoff.
To complement a bit @sadiqj's answer: in native mode, caml_apply<N>
functions are performance black holes due to the indirect function calls it performs. I'm confident that adding a polling point there will not degrade performance in any measurable way.
Thanks both for your comments. I wondered about the effect of polls in caml_apply due to this part-of-explanation of Sadiq:
The main difference between polling and trunk is that caml_apply3 gets a poll. It spends ~95% of it's time in a very hot path (dot -> fold_left2 -> caml_apply3).
You both agree that the outlier-ness of this benchmark is not a sign that we should not poll in caml_apply
let dot = fold_left2 (fun acc xi yi -> acc +. xi *. yi) 0.0
with
let dot x y = fold_left2 (fun acc xi yi -> acc +. xi *. yi) 0.0 x y
In this particular case, the caml_apply3
call comes from the code of fold_left2
, so I don't think eta-expanding dot
will help (though it's obviously a good idea).
I think you could remove the caml_apply3
with flambda (with the right parameters), but otherwise your only solution is to manually specialise the fold_left2
function to its argument.
A simple strategy that might help here:
This removes a poll from caml_applyN at the cost of inserting one in some callers. However, callers which call caml_applyN
in non-tail position won't have to poll.
The justification for this: we make a total order over functions, and the idea is to insert polls on tail calls that are forward references in this ordering, to ensure that a cycle of tail calls includes at least one poll. In the current version, that means putting polls on recursive tail calls (which are forward references), indirect tail calls (which might be forward references), but not direct tail calls to previously-defined functions (which are backward references).
The change I'm suggesting here is to consider caml_applyN to be defined at the very end of this ordering. This has two consequences: (1) tail calls to caml_applyN are always forward references needing polls, and (2) indirect tail calls within caml_applyN are backward references not needing polls, since there's nothing ahead of them they could call.
(I am assuming here that we can't end up with a cycle of tail caml_applyN calls with no intervening user code. I can't see how to cause such a thing, but maybe I'm being insufficiently evil)
(I am assuming here that we can't end up with a cycle of tail caml_applyN calls with no intervening user code. I can't see how to cause such a thing, but maybe I'm being insufficiently evil)
I am sufficiently evil:
type 'a t = T of ('a t -> 'a -> 'a) [@@unboxed]
let f (T g) x = g (T g) x
let _ = f (T f) 0
Wait, I think that's actually wrong: f
contains an indirect call, so will have a poll point. Please forget about this example (but I'll
keep looking though).
I took the liberty of removing the "Draft" mark from this PR, which many people interpret as "not ready for review".
This PR is high on the roadmap for OCaml 4.13 and very much in need of reviews. Let's get started!
I took the liberty of removing the "Draft" mark from this PR, which many people interpret as "not ready for review".
This PR is high on the roadmap for OCaml 4.13 and very much in need of reviews. Let's get started!
Thanks @xavierleroy
The most impactful area for review at this point are the changes to the Mach IR, as changes to those at a later stage will create a lot of work on the architectural ports. Specifically we add an Ipollcall
op and an Ipolltest
test.
Ipolltest
checks the young limit and can be used in an Iifthenelse
with an Ipollcall
on the then arm. This is placed at the end of loops and (in concert with the linearise pass) results in the compact sequence mentioned earlier on in the PR.
Ipollcall
can also be used on it's own and this is what is used for function prologue polls.
There are some small changes in linearize to tighten up the jumps we emit at the end of loops. as well.
So I think the questions at the moment that block the rest of the ports are: is there consensus on the changes to Mach and do we want safepoints on 32-bit platforms?
Thanks @damiendoligez for taking a detailed look at this.
You're right, the path_polls
and requires_prologue_poll
functions have issues. They've been iterated on several times since this PR landed but need an overhaul. I would also love to have a better way of thoroughly testing them.
I'm going to go through your comments on files individually, the ARM64 comparison issue is a nice spot though.
Will also follow up with answers to your questions on poll placement for tail calls and allocating handlers.
For
Itailcall
: Did you consider using the "poll just before the edge" strategy?
No, I hadn't considered it and there is definitely an argument for treating tail calls in a similar manner to loop back edges.
Having had a ponder today - I think this works within our current placement scheme and would mean fewer executed polls in tail recursive functions that have a non-allocating base case. It complicates the rules for placement slightly but in practical terms makes no difference for reasoning about atomicity (the function you are making the call to might have a prologue poll anyway).
The downside though is when we have functions with multiple paths that result in tail calls, we end up increasing code size as we need polls on each one. I think with the performance numbers showing the impact on execution times showing the impact from polls is minimal, we should aim for strategies that minimise code size.
For
Icatch
: How about suppressing the poll when all the paths leading to theIexit
allocate, as well as when the handler itself allocates?
I don't think I understand this case, as if the handler unconditionally allocates at the moment the poll should be elided. Would it be possible to provide a code example so I can see where this would kick it? I'm all for eliding more polls when we safely can.
I don't think I understand this case, as if the handler unconditionally allocates at the moment the poll should be elided. Would it be possible to provide a code example so I can see where this would kick it? I'm all for eliding more polls when we safely can.
For an example, you'll need a recursive Icatch
with two handlers. (I don't know in what circumstances the compiler generates those, so I can't give source code.)
Iexit 2
allocate; Iexit 1
You don't insert a poll before Iexit 2
because its handler allocates unconditionally. If I'm not mistaken the current version of the code would insert a poll before Iexit 1
(because handler 1 doesn't allocate) although it's not needed.
At present I don't think the middle-ends ever generate recursive catches with more than one handler. As far as I recall the support was added because it was expected to be needed for Flambda, which it never was (although Flambda 2 will exploit it).
I've made a pull request on this pull request (https://github.com/sadiqj/ocaml/pull/3), currently running some benchmarks to see if it improves the number of poll insertions.
The code size increase is almost exactly the same. I'll try to remove the polls on raise to see if it makes a big difference.
Here are the results without poll-on-raise. Much better as far as code size is concerned. Brown is with poll-on-raise, blue is without. The baseline is without any polling.
Here are the results without poll-on-raise. Much better as far as code size is concerned. Brown is with poll-on-raise, blue is without. The baseline is without any polling.
Thanks for running these numbers @damiendoligez . Just to clarify, both brown and blue are with your safepoints branch but blue does not add a poll on raise?
If so, I think strictly for poll-on-raise Iraise
in path_approx
needs to be Alloc
. I would be surprised if this made a massive difference to density though, it would only help in situations where you were unconditionally raising.
Thanks for running these numbers @damiendoligez . Just to clarify, both brown and blue are with your safepoints branch but blue does not add a poll on raise?
In fact brown is your branch and blue is the latest version of my branch.
If so, I think strictly for poll-on-raise
Iraise
inpath_approx
needs to beAlloc
. I would be surprised if this made a massive difference to density though, it would only help in situations where you were unconditionally raising.
I was doing that in my first version (the one that gives almost identical results with your branch).
I am now reviewing the other parts of this patch (everything except Polling
, although I'll look at that too in due course).
I've got a segfault in one of the test files. The problem, as far as I understand it:
fun_contains_calls
in selection/selectgenThe test function is just an empty loop, so its fun_contains_calls
field is false and it doesn't get a stack alignment prologue. But it gets a poll inserted, which does call caml_call_gc
and eventually crashes (when llvm, being clever, uses SSE/AVX instructions to do FP computations).
The test function is just an empty loop, so its fun_contains_calls field is false and it doesn't get a stack alignment prologue.
Hrm, that's not good.
We could set fun_contains_calls
to true if we ended up inserting a poll in a function but that means for any analysis between selectgen and polling, decisions could be made on that field being false.
Alternatively we could align the stack in the emitter based on fun_contains_calls
or whether there was a poll in the IR.
Neither of those seem like great choices though.
My impression is that computing fun_contains_calls
early, during instruction selection, is a historical accident: it looks like fun_contains_calls
is used only in the code emitters, at the very end of the back-end, so we could compute it late, e.g. during linearization.
Update 01/02/2021 Some of this summary discusses an earlier implementation of Safepoints. Be sure to check the comment later on for changes.
Summary This is a draft Safepoints implementation for AMD64 on the 4.11 branch to allow for testing and feedback. Performance impact is probably minimal. Could do with more eyes, especially from the statmemprof team. May be other implementation options.
This is a draft PR that proposes an implementation of safepoints polls in OCaml. The current implementation only supports AMD64 but if there is consensus for this approach we will add support for other platforms as part of the final PR. We submit this as a PR on the 4.11 branch to allow testing on existing codebases.
Why safepoint polls are necessary
At the moment OCaml uses allocations as an opportunity to deliver interrupts to the mutator. These may be signals, requested minor/major collections, etc..
If a program has code that does not allocate for a long time or at all, this can lead to these interrupts never being delivered. This creates correctness issues as seen in https://github.com/ocaml/ocaml/issues/7128 and https://github.com/ocaml/ocaml/issues/3747.
In addition, Multicore OCaml relies on this interrupt mechanism to bring domains into a stop-the-world pause for the minor collections and major collector phase changes.
How this PR implements safepoints
Safepoints are implemented by adding a new
Ipoll
operation to Mach. This poll operation expands to a comparison of the young limit to the (cached) allocation pointer in %r15 (on amd64) with a jump to caml_call_poll via the epilogue. This in term calls caml_poll and lastlycaml_process_pending_actions
.For this to work, on architectures which cache the young limit (arm64, riscv) in a register, we will need to uncache this (https://github.com/ocaml/ocaml/pull/9876).
These polls are added, at most, in three places:
These potential poll sites are chosen to ensure that safepoint polls are run in a bounded amount of time whilst trying to minimise the impact on code size and performance.
The chosen poll sites are similar to those in the
minimal polling
configuration in Marc Feeley's "Polling Efficiently on Stock Hardware". They differ in that OCaml has a loop construct that must be dealt with (Feeley only considers loops via tail recursion) and that we omit a poll before function return.The lack of a poll before function return does result in a potential time between polls bounded only by the size of the OCaml stack (consider the case of a very deep stack of non-tail recursive functions being unwound) but this is traded-off by a halving of the number of polls required.
Optimisations
The first optimisation we do (which results in similar logic to that in Feeley's minimal polling configuration) is to elide polls in leaf functions. We define a leaf function to be a function that does not call any other function and does not contain any loops.
The next set of optimisations exploit the fact that allocation in OCaml also results in a check against the young limit. Due to this we can elide polls in functions that allocate on all paths and also at the back edge of loops where the loop body allocates on all paths.
Polling at loop back edges
Some experimentation was carried out to try to avoid the extra memory access (for the young limit) required at the loop backedge test. Instead an explicit loop counter was added to each recursive handler, decremented and used to carry out the full poll check only after some number of loop iterations occurred. This was also implemented at Mach as a virtual register, in the hope that for most loops this could be kept in a register. Amongst the hottest loops in the sandmark suite, this was more often than not spilled and resulted in worse performance.
Specific optimisation strategies for loops also create an awkward situation where explicit loops and tail recursion have different performance characteristics. This may be less of a problem longer term as we understand that Flambda 2 aims to bring contification in the medium term.
Alternative approaches
Feeley’s algorithm (non-mimimal)
Referenced earlier, Feeley’s algorithm for balanced polling involves placing poll points in a way that gives a maximum tolerance for the number of instructions between polls. This is useful for being able to guarantee timely delivery of interrupts but can require placing poll points between arbitrary operations.
Since poll points can result in a garbage collection which can run arbitrary OCaml code or raise asynchronous exceptions, placing poll points between operations can break existing assumptions about atomicity. Even if we ignore existing code that relies on these assumptions, it is unclear what the interface should be for users wishing to write code containing critical sections and how that might interact with the flambda optimisation passes.
Instead, as mentioned earlier, this PR takes a similar approach to the “Minimal Polling” configuration in Feeley’s paper but with fixed locations where polls may appear.
It seems there is little code in the ecosystem that relies on atomicity between loop iterations. The assumption holds for the commonly used nano mutex in Core (https://github.com/janestreet/core/blob/master/nano_mutex/src/nano_mutex.ml).
Instrumentation at Lambda
A change to add polls at the Lambda level was suggested in the developer meeting as this side-stepped the earlier issue that some sequences of Mach operations were unsafe to add polls between, due there being uninitialised data in the heap or there being live
Addr
values. Uninitialised data in the heap can occur when a poll instruction is added between the allocation of a value and the initialisation of its' fields.The current solution does not insert polls between arbitrary Mach operations and hence, avoids the above issues.
By performing poll insertion at Mach, we also avoid inserting unnecessary polls. Consider a function called in a loop, which happens to be inlined. If the polls were inserted at Lambda, then the loop may have two poll points, one from the loop and another from the inlined function. The inliner doesn’t need to do anything special to elide poll points because the poll points are inserted after inlining (when the code is much closer to it's final state).
We believe that instrumentation at Lambda would likely still require a pass at a later stage to remove unnecessary polls.
Using OS signals
An alternative approach which would avoid the overhead of adding poll points is to use OS signals to interrupt running OCaml code. This is actually an approach used by Go from version 1.14 onwards: https://github.com/golang/proposal/blob/master/design/24543-non-cooperative-preemption.md
On receipt of the signal a Go thread checks if it is in a place safe for a GC otherwise it resumes running and another signal is tried later. For this to actually be efficient, Go needs to maintain stack and register maps at nearly ever instruction.
This might work for us as well but would require significant engineering effort across many parts of the compiler and runtime.
Performance
Fundamentally the polls come down to two instruction sequences that check the the young pointer against the young limit (which in this PR has been moved to the first field of domain state in order to short the
cmp
we need to emit for a poll). For function prologues and at the raising of exceptions we emit:The back edge of loops go from being an unconditional jump to:
It would be good to simplify the second sequence so the jump in the poll epilogue code goes directly to the loop_head, this would save a number of bytes. (I could do with some pointers on the cleanest way to do this)
Why concrete performance numbers are hard
Testing the performance impact of safepoints is surprisingly tricky as adding extra instructions shifts the alignment of later instructions and this can lead to reasonably large swings in performance (see https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf). On x86 this is often down to instruction cache and decoder quirks. Jane Street and in particular @gretay-js have done a fair bit of using runtime feedback to optimise layout in OCamlFDO
For a slightly more robust evaluation, we instrumented a version of the trunk compiler on version 4.11 to pad each function’s prologue with a random length of nop instructions (between 0 and 31 bytes, using the minimum variable sized nops up to 14 bytes to satisfy the size): https://github.com/sadiqj/ocaml/commit/5569783df5813cbe8d62c81fa046268dd2d82a1a . We then used this compiler to repeatedly run the sandmark benchmarking suite on both AMD Zen2 and an Intel Skylake server set up for benchmarking.
The graphs below are normalised to the performance of 4.11 trunk, with the box bars indicating the extends of the minimum and maximum runtimes seen through the instrumented fuzzing alignment compiler and with the red markers indicating the performance of the polling rebased on 4.11.
Labels give the name of each benchmark and the number in brackets is 4.11 trunk’s runtime in seconds.
AMD Zen2
Intel Skylake
Code size impact
For code size impact, we use the change in the OCaml text section size (OTSS). We define OTSS as the sum of the sizes of all the OCaml text sections in the compiled binary file ignoring the data sections, the debug symbols, the text sections associated with OCaml runtime and other statically linked C libraries. The computation for this happens in this part of sandmark.
Code size increases are between 6% to 11% (relative to the 4.11 branch) on the macro benchmarks in the Sandmark suite:
We did a perf run with some of the benchmarks showing the largest increase in codesize to look at the actual change in instructions retired (as much of the increase is likely to be cold code in the epilogue):
Eliding and leaf function optimisations
We also tested the effectiveness of the eliding optimisation and leaf function optimisations in reducing the number of polls by comparing the codesize of generated binaries. The following graph is normalised to codesize of the 4.11 branch.
no_elide
disables the eliding of polls in functions and loop bodies that always allocate.no_elide_no_leaf
additionally disables the optimisation that removes polls from leaf functions:The optimisations, on average, lead to about a 30% reduction in codesize increase from polling which indicates they reduce the number of polls added by about the same amount. The bulk of the impact comes from not putting polls in leaf functions.
Things to consider
Disabling polling (via an annotation or otherwise)
There are two reasons one might want to disable polling when writing code.
First is when writing critical sections. There are correctness issues with callbacks like finalisers and signal handlers (users already hit them: https://github.com/ocaml/ocaml/issues/8794) occurring in a critical section. For multicore there are also potential performance issues. Consider the case where a domain holds a hot lock, then hits a poll point and decides to do a major slice.
Second is tight loops where cost of the extra memory access in the loop negatively impacts the performance significantly. It's worth pointing out that sandmark's numerical benchmarks don't seem to show a performance impact beyond layout-related effects.
The two cases can be solved in different ways. In the first we don't actually need to remove poll points, just disabling their action is sufficient. For the second, removing the poll points themselves is necessary and it would make more sense to use an annotation here.
@gadmm has some thoughts on this. He has made the suggestion that poll points should not result in callbacks. This would obviate the need for disabling polling in critical sections for correctness. It would not solve the potential performance aspect though and would also leave issues such as https://github.com/ocaml/ocaml/issues/3747 unsolved. For multicore's need this may suffice though.
Possible future optimisations
Interprocedural optimisations
Currently we don’t assume that a call to another function will result in a poll, this is due to the leaf function optimisation. If we know which functions will definitely have polls then we may be able to elide significantly more polls. @mshinwell had a few ideas on how we might do this.
Non-noalloc calls
@stedolan has a potential way we could have a lightweight poll on calls to non-noalloc external calls - this would mean we could elide polls on any functions that unconditionally call one.
How to read this PR
There are two places where polls are added. The first in asmgen and the second in selectgen. All of the supporting analyses are in polling.