sadiqj commented 3 years ago

Update 01/02/2021 Some of this summary discusses an earlier implementation of Safepoints. Be sure to check the comment later on for changes.

Summary This is a draft Safepoints implementation for AMD64 on the 4.11 branch to allow for testing and feedback. Performance impact is probably minimal. Could do with more eyes, especially from the statmemprof team. May be other implementation options.

This is a draft PR that proposes an implementation of safepoints polls in OCaml. The current implementation only supports AMD64 but if there is consensus for this approach we will add support for other platforms as part of the final PR. We submit this as a PR on the 4.11 branch to allow testing on existing codebases.

Why safepoint polls are necessary

At the moment OCaml uses allocations as an opportunity to deliver interrupts to the mutator. These may be signals, requested minor/major collections, etc..

If a program has code that does not allocate for a long time or at all, this can lead to these interrupts never being delivered. This creates correctness issues as seen in https://github.com/ocaml/ocaml/issues/7128 and https://github.com/ocaml/ocaml/issues/3747.

In addition, Multicore OCaml relies on this interrupt mechanism to bring domains into a stop-the-world pause for the minor collections and major collector phase changes.

How this PR implements safepoints

Safepoints are implemented by adding a new Ipoll operation to Mach. This poll operation expands to a comparison of the young limit to the (cached) allocation pointer in %r15 (on amd64) with a jump to caml_call_poll via the epilogue. This in term calls caml_poll and lastly caml_process_pending_actions.

For this to work, on architectures which cache the young limit (arm64, riscv) in a register, we will need to uncache this (https://github.com/ocaml/ocaml/pull/9876).

These polls are added, at most, in three places:

At the start of functions (after the prologue)
At the back edge of loops
At the raising of an exception

These potential poll sites are chosen to ensure that safepoint polls are run in a bounded amount of time whilst trying to minimise the impact on code size and performance.

The chosen poll sites are similar to those in the minimal polling configuration in Marc Feeley's "Polling Efficiently on Stock Hardware". They differ in that OCaml has a loop construct that must be dealt with (Feeley only considers loops via tail recursion) and that we omit a poll before function return.

The lack of a poll before function return does result in a potential time between polls bounded only by the size of the OCaml stack (consider the case of a very deep stack of non-tail recursive functions being unwound) but this is traded-off by a halving of the number of polls required.

Optimisations

The first optimisation we do (which results in similar logic to that in Feeley's minimal polling configuration) is to elide polls in leaf functions. We define a leaf function to be a function that does not call any other function and does not contain any loops.

The next set of optimisations exploit the fact that allocation in OCaml also results in a check against the young limit. Due to this we can elide polls in functions that allocate on all paths and also at the back edge of loops where the loop body allocates on all paths.

Polling at loop back edges

Some experimentation was carried out to try to avoid the extra memory access (for the young limit) required at the loop backedge test. Instead an explicit loop counter was added to each recursive handler, decremented and used to carry out the full poll check only after some number of loop iterations occurred. This was also implemented at Mach as a virtual register, in the hope that for most loops this could be kept in a register. Amongst the hottest loops in the sandmark suite, this was more often than not spilled and resulted in worse performance.

Specific optimisation strategies for loops also create an awkward situation where explicit loops and tail recursion have different performance characteristics. This may be less of a problem longer term as we understand that Flambda 2 aims to bring contification in the medium term.

Alternative approaches

Feeley’s algorithm (non-mimimal)

Referenced earlier, Feeley’s algorithm for balanced polling involves placing poll points in a way that gives a maximum tolerance for the number of instructions between polls. This is useful for being able to guarantee timely delivery of interrupts but can require placing poll points between arbitrary operations.

Since poll points can result in a garbage collection which can run arbitrary OCaml code or raise asynchronous exceptions, placing poll points between operations can break existing assumptions about atomicity. Even if we ignore existing code that relies on these assumptions, it is unclear what the interface should be for users wishing to write code containing critical sections and how that might interact with the flambda optimisation passes.

Instead, as mentioned earlier, this PR takes a similar approach to the “Minimal Polling” configuration in Feeley’s paper but with fixed locations where polls may appear.

It seems there is little code in the ecosystem that relies on atomicity between loop iterations. The assumption holds for the commonly used nano mutex in Core (https://github.com/janestreet/core/blob/master/nano_mutex/src/nano_mutex.ml).

Instrumentation at Lambda

A change to add polls at the Lambda level was suggested in the developer meeting as this side-stepped the earlier issue that some sequences of Mach operations were unsafe to add polls between, due there being uninitialised data in the heap or there being live Addr values. Uninitialised data in the heap can occur when a poll instruction is added between the allocation of a value and the initialisation of its' fields.

The current solution does not insert polls between arbitrary Mach operations and hence, avoids the above issues.

By performing poll insertion at Mach, we also avoid inserting unnecessary polls. Consider a function called in a loop, which happens to be inlined. If the polls were inserted at Lambda, then the loop may have two poll points, one from the loop and another from the inlined function. The inliner doesn’t need to do anything special to elide poll points because the poll points are inserted after inlining (when the code is much closer to it's final state).

We believe that instrumentation at Lambda would likely still require a pass at a later stage to remove unnecessary polls.

Using OS signals

An alternative approach which would avoid the overhead of adding poll points is to use OS signals to interrupt running OCaml code. This is actually an approach used by Go from version 1.14 onwards: https://github.com/golang/proposal/blob/master/design/24543-non-cooperative-preemption.md

On receipt of the signal a Go thread checks if it is in a place safe for a GC otherwise it resumes running and another signal is tried later. For this to actually be efficient, Go needs to maintain stack and register maps at nearly ever instruction.

This might work for us as well but would require significant engineering effort across many parts of the compiler and runtime.

Performance

Fundamentally the polls come down to two instruction sequences that check the the young pointer against the young limit (which in this PR has been moved to the first field of domain state in order to short the cmp we need to emit for a poll). For function prologues and at the raising of exceptions we emit:

cmp (%r14),%r15
jbe <poll in epilogue>

The back edge of loops go from being an unconditional jump to:

cmp (%r14),%r15
jg <loop head>
jmp <poll in epilogue>
jmp <loop_head>

It would be good to simplify the second sequence so the jump in the poll epilogue code goes directly to the loop_head, this would save a number of bytes. (I could do with some pointers on the cleanest way to do this)

Why concrete performance numbers are hard

Testing the performance impact of safepoints is surprisingly tricky as adding extra instructions shifts the alignment of later instructions and this can lead to reasonably large swings in performance (see https://people.cs.umass.edu/~emery/pubs/stabilizer-asplos13.pdf). On x86 this is often down to instruction cache and decoder quirks. Jane Street and in particular @gretay-js have done a fair bit of using runtime feedback to optimise layout in OCamlFDO

For a slightly more robust evaluation, we instrumented a version of the trunk compiler on version 4.11 to pad each function’s prologue with a random length of nop instructions (between 0 and 31 bytes, using the minimum variable sized nops up to 14 bytes to satisfy the size): https://github.com/sadiqj/ocaml/commit/5569783df5813cbe8d62c81fa046268dd2d82a1a . We then used this compiler to repeatedly run the sandmark benchmarking suite on both AMD Zen2 and an Intel Skylake server set up for benchmarking.

The graphs below are normalised to the performance of 4.11 trunk, with the box bars indicating the extends of the minimum and maximum runtimes seen through the instrumented fuzzing alignment compiler and with the red markers indicating the performance of the polling rebased on 4.11.

Labels give the name of each benchmark and the number in brackets is 4.11 trunk’s runtime in seconds.

AMD Zen2

Intel Skylake

Code size impact

For code size impact, we use the change in the OCaml text section size (OTSS). We define OTSS as the sum of the sizes of all the OCaml text sections in the compiled binary file ignoring the data sections, the debug symbols, the text sections associated with OCaml runtime and other statically linked C libraries. The computation for this happens in this part of sandmark.

Code size increases are between 6% to 11% (relative to the 4.11 branch) on the macro benchmarks in the Sandmark suite:

We did a perf run with some of the benchmarks showing the largest increase in codesize to look at the actual change in instructions retired (as much of the increase is likely to be cold code in the epilogue):

Benchmark	4.11 instructions	4.11+polling instructions	retired instructions increase
matrix_multiplication.1024	43,664,714,976	45,780,927,780	4.85%
game_of_life.256	74,382,927,340	76,021,511,913	2.20%
qr-decomposition	17,670,808,611	18,255,799,806	3.31%
floyd_warshall.512	29,541,165,257	29,813,730,367	0.92%
quicksort.4000000	17,503,410,772	17,909,921,378	2.32%
cubicle.szymanski_at	2,655,257,735,029	2,842,644,662,492	7.06%

Eliding and leaf function optimisations

We also tested the effectiveness of the eliding optimisation and leaf function optimisations in reducing the number of polls by comparing the codesize of generated binaries. The following graph is normalised to codesize of the 4.11 branch. no_elide disables the eliding of polls in functions and loop bodies that always allocate. no_elide_no_leaf additionally disables the optimisation that removes polls from leaf functions:

The optimisations, on average, lead to about a 30% reduction in codesize increase from polling which indicates they reduce the number of polls added by about the same amount. The bulk of the impact comes from not putting polls in leaf functions.

Things to consider

Disabling polling (via an annotation or otherwise)

There are two reasons one might want to disable polling when writing code.

First is when writing critical sections. There are correctness issues with callbacks like finalisers and signal handlers (users already hit them: https://github.com/ocaml/ocaml/issues/8794) occurring in a critical section. For multicore there are also potential performance issues. Consider the case where a domain holds a hot lock, then hits a poll point and decides to do a major slice.

Second is tight loops where cost of the extra memory access in the loop negatively impacts the performance significantly. It's worth pointing out that sandmark's numerical benchmarks don't seem to show a performance impact beyond layout-related effects.

The two cases can be solved in different ways. In the first we don't actually need to remove poll points, just disabling their action is sufficient. For the second, removing the poll points themselves is necessary and it would make more sense to use an annotation here.

@gadmm has some thoughts on this. He has made the suggestion that poll points should not result in callbacks. This would obviate the need for disabling polling in critical sections for correctness. It would not solve the potential performance aspect though and would also leave issues such as https://github.com/ocaml/ocaml/issues/3747 unsolved. For multicore's need this may suffice though.

Possible future optimisations

Interprocedural optimisations

Currently we don’t assume that a call to another function will result in a poll, this is due to the leaf function optimisation. If we know which functions will definitely have polls then we may be able to elide significantly more polls. @mshinwell had a few ideas on how we might do this.

Non-noalloc calls

@stedolan has a potential way we could have a lightweight poll on calls to non-noalloc external calls - this would mean we could elide polls on any functions that unconditionally call one.

How to read this PR

There are two places where polls are added. The first in asmgen and the second in selectgen. All of the supporting analyses are in polling.

sadiqj commented 3 years ago

Testsuite passes with the debug runtime on Linux/AMD64 for me locally. All but one of the CI jobs is failing because there's only support for AMD64.

One of the builds is failing because there's potentially an interaction between this PR and Flambda which causes the included polling insertion test to fail, am investigating.

Gbury commented 3 years ago

Feeley’s algorithm (non-mimimal)

Referenced earlier, Feeley’s algorithm for balanced polling involves placing poll points in a way that gives a maximum tolerance for the number of instructions between polls. This is useful for being able to guarantee timely delivery of interrupts but can require placing poll points between arbitrary operations.

Since poll points can result in a garbage collection which can run arbitrary OCaml code or raise asynchronous exceptions, placing poll points between operations can break existing assumptions about atomicity. Even if we ignore existing code that relies on these assumptions, it is unclear what the interface should be for users wishing to write code containing critical sections and how that might interact with the flambda optimisation passes.

Since introducing polling points at arbitrary points in the source program is indeed a problem, could feeley's algorithm instead be used as a check, to verify that there is never more than some unit of time/code space between two safe points (or even to compute the uppper bound) ? The compiler could then emit a warning if there exists a path of more than some measure of length that is without safe points (that measure maybe being configurable through a command line option). Add to that a primitive to introduce polling points in source code so that user can manually resolve such warnings, and it could probably make a decent compromise (though I don't know if that would need to be in addition to or as an alternative to the already implemented safe point introduction mechanism).

Have you thought/tried something along these lines, or did I miss something ?

gasche commented 3 years ago

I looked at Feeley's paper. If I understand correctly, he mentions a "naive" way to insert safepoints, at the entry and return of each procedure (including loop bodies), and then a "clever" way to insert them by allowing a certain latency between polls. This PR implements the "naive" approach. Using the "naive" approach works around an issue with a previous Mach-level implementation, which was that safepoints would sometimes be inserted at places where GC-unsafe temporaries were alive, which is incorrect. If we assume that the optimizer has already aggressively optimized functions, then in many cases (... but not necessarily in tight loops or tight recursive functions) the naive approach should be fine.

Some questions:

Why are we also polling when exceptions are raised? This sounds like a return to me, and you are not polling at return points.
In the past we discussed placing performing this transformation at the Lambda level, rather than at the Mach level. Has someone tried to do this, to compare the results? (It sounds easier at the Lambda level to use user-placed attributes to control polling; they could still be elided at the [Closure|Flambda] or Mach level if it is easier.)

xavierleroy commented 3 years ago

Some more questions:

In your graph with horizontal bars, what does the X axis mean? When it says "1.1", which ratio is that?
Did you measure "hard" execution counts such as total number of instructions executed? These counts don't fully correlate with execution speed, but are independent of secondary effects such as code placement, so I find them useful to get another idea about the impact of a compiler change.
Could we please avoid introducing another runtime entry point beyond caml_call_gc? Those entrypoints are a pain to code in asm. See also #8805.

sadiqj commented 3 years ago

I looked at Feeley's paper. If I understand correctly, he mentions a "naive" way to insert safepoints, at the entry and return of each procedure (including loop bodies), and then a "clever" way to insert them by allowing a certain latency between polls.

So there's actually three different ways mentioned in Feeley's paper. There's the naive call-return that he contrasts with, the balanced polling algorithm which limits the number of instructions between polls and then minimal polling, which is balanced polling but where the limit is effectively unlimited.

The insertion strategy here, along with the optimisation for leaf functions, are effectively minimal polling but without the potential poll before return.

* Why are we also polling when exceptions are raised? This sounds like a return to me, and you are not polling at return points.

This was mainly to simplify the analysis for eliding polls as it meant not needing to deal with exceptional control flow. Having a ponder just now though, we may be able to avoid this if raising an exception no longer made a function a leaf.

* In the past we discussed placing performing this transformation at the Lambda level, rather than at the Mach level. Has someone tried to do this, to compare the results? (It sounds easier at the Lambda level to use user-placed attributes to control polling; they could still be elided at the [Closure|Flambda] or Mach level if it is easier.)

I did actually also do an implementation of safepoints at lambda (https://github.com/sadiqj/ocaml/tree/safepoints_lambda has a work in progress) branch which could be brought up to speed. My initial findings on that branch though were that we ended up with a superset of the polls at Mach, with the extra being from functions that ended up inlined.

* In your graph with horizontal bars, what does the X axis mean?  When it says "1.1", which ratio is that?

Those numbers are runtime, normalised to the 4.11 branch. So 1.1 would be a 10% increase in runtime over 4.11.

Did you measure "hard" execution counts such as total number of instructions executed? These counts don't fully correlate with execution speed, but are independent of secondary effects such as code placement, so I find them useful to get another idea about the impact of a compiler change.

These are actually what are shown in the table under the code size graph. The numbers are retired instructions from the performance counters. I'll make that clearer in the headings.

* Could we please avoid introducing another runtime entry point beyond `caml_call_gc`?  Those entrypoints are a pain to code in asm.  See also #8805.

I can understand they're a pain. I'll investigate. I think the tricky bit is disambiguating a poll from an allocation.

sadiqj commented 3 years ago

Since introducing polling points at arbitrary points in the source program is indeed a problem, could feeley's algorithm instead be used as a check, to verify that there is never more than some unit of time/code space between two safe points (or even to compute the uppper bound) ? The compiler could then emit a warning if there exists a path of more than some measure of length that is without safe points (that measure maybe being configurable through a command line option). Add to that a primitive to introduce polling points in source code so that user can manually resolve such warnings, and it could probably make a decent compromise (though I don't know if that would need to be in addition to or as an alternative to the already implemented safe point introduction mechanism).

Have you thought/tried something along these lines, or did I miss something ?

I haven't considered this approach, no. So far the idea here has been to have as automatic a process as possible whilst trying to break as little existing code as we can.

It's also worth pointing out here that for the outstanding trunk issues safepoints would solve and also for multicore, the tolerable time between polls can be pretty high. Even high tens of thousands of instructions would suffice.

sadiqj commented 3 years ago

Tests should now pass on Flambda, I've also pulled @dra27 's github actions.

dra27 commented 3 years ago

The AppVeyor failure is now real - the changes to amd64.S will need making the amd64nt.asm as well which I expect is the problem. You want to disable the i386-static workflow as well.

stedolan commented 3 years ago

This looks nice, and the limited set of possible polling positions sounds like a great idea. I'm surprised by the code size impact, though - 6-11% is much higher than I thought this would be.

I think it might be possible to remove most of these polls by using a different criterion. Consider this function:

let f x =
  M.blah x;
  x + 1

It has no allocation and is not a leaf function. Yet I claim no poll is needed here: the call to M.blah is a direct call to a function previously defined in another module, which has already had polling inserted if necessary.

Slightly more formally: I think the goal is to ensure that evaluation of any OCaml expression eventually either polls (via allocation or inserted polls) or terminates (normally or with an exception), so that it does not cause hangs. In other words, we must ensure that every infinite sequence of evaluation steps contains a poll.

Any such infinite sequence must contain infinitely many calls to some Mach function or iterations of some loop, so putting polls on all function and loop entrypoints would suffice (but is expensive). To optimise, let's consider the function case. We can make a partial ordering over Mach functions: in definition order within a compilation unit, and dependency order between compilation units. This order is acyclic, as OCaml does not allow circular dependencies between modules. It's also finite, so if there's an infinite sequence of function calls then something has to make a forward reference.

Also, in such an infinite sequence of function calls, at most finitely many of them can be non-tail calls. (If there are infinitely many non-tail calls, then the program soon terminates with a stack overflow).

So, every such infinite sequence must contain many forward-referencing tail calls, so polling only on those suffices. In other words, a function entrypoint needs a poll only if there is some path through the function which:

does not contain an allocation
ends in a tail call
where tail-called function may be a forward reference

"may be a forward reference" is defined as being an indirect call, or a direct call to a function defined later in the same compilation unit.

The reasoning here is a bit tricky, but it could remove a lot of polls. I've had a go at hacking up a version of this patch that uses this criterion - I strongly suspect it's buggy (I think there's something weird with liveness around polling points?), so YMMV. But on ocamlopt.opt, it inserts 80-90% fewer poll points than the current strategy.

gadmm commented 3 years ago

One key design choice of this PR is to introduce polling points at few predictable locations, in order to affect as little as possible the current reasoning on polling points, and avoid as much as possible breaking existing code. I find this choice, and that of having an analysis to elide poll in Mach very pragmatic. I have critical sections in two draft PRs about masking that are a bit more subtle than the ones in the janestreet code mentioned in this PR. The result is that no adaptation is required for these either. (Some test code will be broken, but some is already broken in bytecode due to polling at function call.)

I am interested in the two points “Alternative approaches: Instrumentation at Lambda” and “Things to consider: Disabling polling (via an annotation or otherwise)”.

If the user wants to reason about polling locations, specifying the behaviour of polling point insertion in terms of the back-end is not enough. There are at least some implicit assumptions on how the polling locations in Mach relate to the source code. As it turns out, it is already tricky to guarantee that flambda respects atomicity (see #10035, though I did not understand if it could break atomicity already or only allowed as a future possibility by flambda's model). So flambda would already benefit from an annotation with which the user could request a strict interpretation of polling locations, in addition to making things clearer from a language specification point of view.
My use-case is targeted by the comment "disabling the action [of polling locations] is sufficient", i.e. with masking. I investigate abstractions for resource-management that, among others, correctly deal with asynchronous exceptions at no added cost, motivated by multicore taking the path of adding a primitive notion of first-class resource with fibers currently incompatible with said asynchronous exceptions. To implement such abstractions so far I can say that one needs both masking and control on polling locations for some critical sections—masking alone is not enough. Most is already addressed by the limited number of polling locations you introduce. It would also make things easier if an annotation let us remove the polling location at the start of functions (simply annotating a begin...end block might miss this, depending on how it is specified, whereas an annotation on functions could be simpler to specify). I can also see future uses of controlling the polling before raise (but only for re-raising an exception).
Emitting potential polling locations at Lambda gives a path to bring bytecode in line with native in terms of polling behaviour. Eliding polling in Mach would still be necessary, but the hardcoded polling locations in bytecode could be removed. Together with an annotation that locally disables the emission of polling locations at Lambda (and properly understood by flambda), native and bytecode would benefit from the same atomicity guarantees.

I would also like to clarify:

@gadmm has some thoughts on this. He has made the suggestion that poll points should not result in callbacks. This would obviate the need for disabling polling in critical sections for correctness. It would not solve the potential performance aspect though and would also leave issues such as #3747 unsolved. For multicore's need this may suffice though.

To summarise I see five separate issues to fix:

have enough polling locations for multicore
have enough polling location to fix #7128 ("ref table grows without bounds")
bring bytecode polling behaviour in line with native
justify the transformations used in flambda (allocation points can be moved around, e.g. #10035)
have enough polling location to fix #3747 ("catch_break does not always catch break")

The issues are ordered by difficulty. My point is that only 5. really requires any change to OCaml's behaviour regarding asynchronous callbacks, with associated compatibility risks. For the other ones it is enough to just call caml_check_urgent_gc instead of caml_process_pending_actions inside caml_poll (leaving aside your point on performance). I think poll points should result in callbacks (so that 5. is fixed eventually), but I was pointing out that one can get there in several steps, by not introducing any potential breaking change (5.) before a way to control polling locations has been around long enough (4.), and by having the first steps before like this PR (1. and 2.).

gasche commented 3 years ago

I can also see future uses of controlling the polling before raise (but only for re-raising an exception).

This sounds simple enough to fix: we could never insert polling when the %reraise primitive is used. (I'm assuming the distinction between raise and reraise is still present in Mach, which may not be the case). One should use reraise when catching an exception that was just raised, so if raises poll it is clear that there was a poll shortly before. (Note that it is still not clear to me why raises should poll at all, if we consider that returning a value need not poll.)

stedolan commented 3 years ago

Note that it is still not clear to me why raises should poll at all

There's a trade-off here. Consider this function:

let rec retry f x =
   try Some (f ())
   with Not_found -> retry f x

Every non-exceptional path through this function allocates, but it is possible to end up in a non-allocating loop with retry (fun _ -> raise Not_found).

There are two options that ensure sufficient polling:

Insert a poll in the at the start of retry, because it has a non-allocating loop on the exceptional path
Insert a poll in raise, so that exceptional paths always poll

gasche commented 3 years ago

@stedolan thanks for the example. When you say in f, I'm not sure whether you meant in the prelude of retry, or right before the call f (). Note that it would also be possible to have it before the recursive call retry f x.

stedolan commented 3 years ago

When you say in f, I'm not sure whether you meant in the prelude of retry, or right before the call f ().

Typo! Fixed now.

(When fixing this, I accidentally clicked "mark as ready for review", which seems to be an irrevocable mistake :( )

gasche commented 3 years ago

If we see this function as

let rec retry f x =
  match f x with
  | v -> Some v
  | exception Not_found -> retry f x

it is very natural to "push" the poll in both branches, and elide it in the value case, giving an efficient poll placement.

sadiqj commented 3 years ago

it is very natural to "push" the poll in both branches, and elide it in the value case, giving an efficient poll placement.

Yes, the downsides of doing this though is that users can no longer easily reason about where polls might appear. With the current rules, you should assume there's always a poll in the function prologue, loop backedge and when raising an exception - they just might be removed. This gives us room to add or remove polls at a later time without breaking existing assumptions e.g both the PR's original optimisations and Stephen's one wouldn't change the atomicity behaviour of code which took those placements in to consideration.

If I'm understanding correctly, to be able to do an optimisation like this match one we would also need to extend that to polls at the start of each arm of a match.

gasche commented 3 years ago

My reasoning was as follows:

if the scrutinee of the match is pure (no computation), then pushing the poll inside the branches is not an observable change
if the scrutinee of the match is an expression that may poll itself, then polling right after the match is not wrong

The problem with this reasoning is the "may poll" part: if we know for sure that the scrutinee polls, we can elide all polls. But in this case with f x, the user may reason about the particular function f they are passing to retry and they may know that it polls, or that it does not, and in the latter case the move is not valid.

lpw25 commented 3 years ago

Yes, the downsides of doing this though is that users can no longer easily reason about where polls might appear.

I can see why this is an attractive property, and I'm not proposing to do differently now, but in the long run I think we should move to a model where the user should assume that a poll may happen anywhere and then provide an explicit construct to prevent polling.

stedolan commented 3 years ago

There is another advantage to inserting polls only at the start of functions: it inserts fewer polls than pushing them downward, which may sometimes duplicate polls. In the original analysis at the top of this PR, the performance overhead was mostly within measurement noise while the code size overhead was significant, so focusing on code size seems reasonable.

gadmm commented 3 years ago

Since there is a belief that the delay between polling location will be negligible, there is a feature at #8961 that could interest you. It makes is possible to set up a signal handler in such a way that all polling locations are explored, by re-recording the signal from the handler itself. This has proved invaluable in tests. It might be possible to use it to get empirical data on the delay between polling locations.

sadiqj commented 3 years ago

(I’ve pushed a new branch to this PR - if people have made comments on specific bits of code earlier on then they may disappear. I did a pass before the push and they were all either addressed in a comment or will be in the text below. The 4.11.2 in the branch name is inaccurate - it is now against trunk)

This is an update to the polling implementation. There are a bunch of changes from the earlier implementation.

New function prologue poll eliding algorithm

I have changed the function prologue poll eliding algorithm to the one suggested by @stedolan. To borrow his summary, we only put polls in function prologues if there is a path through the function which:

Does not contain an allocation
Ends in a tail call
- Where the tail call is indirect or to a function that has yet to be compiled or self

This has a significant effect on codesize impact which is now 2.2% to 3.6% on AMD64 (down from 6-11% in the previous implementation). Benchmarks below.

(I have also ported this prologue poll eliding algorithm to multicore and @shubhamkumar13 was able to run our multicore parallel benchmarks. We are satisfied that this implementation is sufficient for multicore scalability)

Rebased to trunk

To run the widest range of benchmarks and to allow users to test on private codebases the initial PR was against the 4.11. This implementation has now been rebased against trunk which necessitated a little bit of work in selectgen due to the removal of spacetime.

Removed extra runtime entry

I’ve removed the _caml_callpoll runtime entry that @xavierleroy requested. Instead we use the allocation debugging data to indicate a poll has occurred. This now means there are no asm changes required and simplifies new architectures significantly.

arm64 support

This implementation adds arm64 architecture support. This was slightly more complicated than amd64 due to the need for a far poll (to mirror far alloc). I think my branch relaxation support code is correct but it might be worth someone double checking the maths. Benchmarks below.

i386 support

Also included is i386 support. The paucity of registers smoked out a bug in the last implementation where polls could appear between a move to a physical register and the subsequent raise.

Windows support

Removing the _caml_callpoll runtime entry meant only small changes were needed to get the Windows build to work and this implementation now passes on the 32-bit and 64-bit Appveyor builds.

Codegen cleanups

In the previous implementation the back edge of each loop on amd64 would look like:

cmp (%r14),%r15
jg <loop head>
jmp <poll in epilogue>
jmp <loop_head>

This has now been cleaned up (for both AMD64 and arm64). The code in the epilogue now jumps directly to the loop head and the last unconditional loop is no longer emitted. This saves 2-6 bytes per loop on AMD64 and 4 bytes on arm64.

Benchmarks

As this branch is now rebased to trunk the selection of benchmarks from sandmark are a little more limited. I rebased the nopnop patch used in the earlier performance measurements onto the trunk and also added support for arm64.

I ran benchmarks on sherwood (an AMD EPYC 7702) and thunderx (a Cavium ThunderX CN8890). For sherwood I ran 64 iterations of nopnop, for thunderx I only ran 16 iterations as just 16 took 2 days of wallclock time.

The CN8890 is a pretty old processor, we’re hoping to have some new Neoverse-based benchmarking setups available soon and I will provide benchmarks on those for the completed PR.

Runtime

Interesting but not unexpected sidenote, the ThunderX doesn’t have the frontend funkiness of the Epyc. The alignment of functions seems to make less difference on the ThunderX though it’s worth pointing out that there’s about an order of magnitude performance difference between the ThunderX and the Epyc which could probably mask a lot.

Code size

Next steps

I’d like to get some feedback on the revised implementation, especially the bits that affect per-architecture changes before I go through the remaining architectures.

Specifically this would be anything that impacts the two changes to the Mach IR, the Op Ipollcall and the new test Ipolltest.

The remaining architectures to be implemented are:

arm32
ppc32
ppc64
riscv64
s390x

gasche commented 3 years ago

It's not obvious to interpret the runtime-performance graphs. Do I understand correctly?

the blue ranges correspond to some form of of random padding of "nop" in the generated code (with otherwise no change from trunk), to test the pre-existing sensibility of the code to code-layout changes
the brown marks correspond to the (average?) performance of the actual PR, with the extra polling instructions
if the brown mark is well inside the blue range, this means that the effect of the polling instructions are indistinguishable from the effects of code layout changes

With this analysis ind mind, there are 5 benchmarks on the ARM processor (little sensitivity to nop padding) where the PR overhead is above 5%, including one above 10%, and well above the blue range, and there is one benchmark on the AMD processor (high sensitivity to nop padding) where the overhead is above 10% and in the upper limit of the blue range. Do we know what makes those specific benchmarks more sensitive to this change than the others?

sadiqj commented 3 years ago

Thanks for taking a look Gabriel.

Yes, the graphs are produced in the same way the first ones in the PR were. The brown marks correspond to the single execution of the PR (averaging won't help here, subsequent runs will still have the same layout).

Unfortunately the ThunderX is a pretty old system and doesn't have a working perf subsystem, so I've not been able to gather performance counters or do profiling to work out what's going on. We have some more modern arm64 systems arriving soon.

On the AMD64 system, I had a look qr_decomposition. The main difference between polling and trunk is that caml_apply3 gets a poll. It spends ~95% of it's time in a very hot path (dot -> fold_left2 -> caml_apply3) and is pretty branch heavy (~19% of instructions on trunk, 21% on this PR) so I suspect it is very susceptible to anything that might change the alignment of those jump targets. The performance counters show the instruction cache pipeline is stalled significantly longer (~28% more) but there doesn't seem to be more granular performance counters on the Epyc to dig much deeper.

Essentially if your hot code paths are small tight loops then the alignment of those loops can heavily influence performance.

A question for discussion on a separate topic. Do we want to add safepoints on 32-bit platforms? For example, the lack of registers on i386 means the code size impact is significant. On arm 32-bit it looks like we might have to either carve out a temporary register or save/restore, neither of which seem ideal.

gasche commented 3 years ago

The main difference between polling and trunk is that caml_apply3 gets a poll.

Interesting; I would expect the caml_apply<N> and caml_curry<N> to be very hot in many examples. Would performances benefit from ruling them out of the poll-insertion logic, or would we just get a poll in another part of the same hot loops, resulting in the same performance?

sadiqj commented 3 years ago

Looking at caml_apply<N> in isolation, the generated Mach IR involves an indirect tail call which is why there's a poll. The indirect tail call could be to a function that doesn't have a poll.

The caml_curry<N> family that allocate are unmodified since there's no need to poll there. The ones that don't allocate (e.g caml_curry_4_3) do get a poll due to the indirect tail call.

That said - I haven't yet managed to use mutually tail recursive functions to get the compiler to produce an output where the only poll is in caml_apply<N>. They either involve allocation or the compiler doesn't need to resort to a caml_apply. That's not to say it's not possible.

lthls commented 3 years ago

That said - I haven't yet managed to use mutually tail recursive functions to get the compiler to produce an output where the only poll is in caml_apply<N>. They either involve allocation or the compiler doesn't need to resort to a caml_apply. That's not to say it's not possible.

You could try something like:

let[@inline never][@local never] app f x y = f x y

let rec f x y = app g x y
and g x y = f x y

sadiqj commented 3 years ago

You could try something like:

Thanks, that would do it. Inspecting the Mach that code results in, the only poll encountered would be in caml_apply2.

gasche commented 3 years ago

More naive comments.

In my mental model, caml_apply<N> are low-level constructs that are not part of the natural operational semantics of OCaml programs, but inserted to enable a more efficient implementation of n-ary applications. As a thought experiment, it should be possible to take a program before caml_apply<N> are inserted, insert polls in it at this level, and then insert the caml_apply<N>. This suggests that there should always exist some poll-insertion strategies that never poll into those primitives. (I have no idea whether they are better than the current one).

One reason to avoid polling inside them would be that they are called by many different code paths, and only a fraction of them require polling; in a sense they are a control-flow bottleneck. So polling in them inevitably inserts redundant polls in many control-flow paths.

Another, simpler approach (if we wanted to avoid polling in them) would be to have two variants of the function, one that polls and one that does not. Then the poll-insertion code can rewrite their calls to whichever variant is appropriate in a given context.

Finally: before doing any of that, the question is whether there would be any performance benefit there. I don't have any intuition about that (that was my question). But it would be possible to test by just removing any polls inside those functions, and running benchmarks. (This of course may break the regular-polling guarantees, that version would be wrong but it gives an upper bound on potential gains going in this direction.)

sadiqj commented 3 years ago

More naive comments.

I think there are ways we could avoid putting polls in caml_apply<N> and caml_curry<N> but I think they'd come at the cost of significantly more invasive changes (e.g polling analyses appearing at multiple compiler layers) or costly analyses at Mach.

For the latter case, looking at @lthls 's example, I can't see how we'd substitute a non-polling caml_apply<N> without some kind of interprocedural analysis if only operating at the Mach layer.

Finally: before doing any of that, the question is whether there would be any performance benefit there. I don't have any intuition about that (that was my question). But it would be possible to test by just removing any polls inside those functions, and running benchmarks. (This of course may break the regular-polling guarantees, that version would be wrong but it gives an upper bound on potential gains going in this direction.)

I think something worth pointing out here is that we're already looking at a performance impact that's essentially within measurement noise on AMD64. It's unclear whether introducing more complexity in the polling implementation to reduce instruction count is going to be a good-tradeoff.

lthls commented 3 years ago

To complement a bit @sadiqj's answer: in native mode, caml_apply<N> functions are performance black holes due to the indirect function calls it performs. I'm confident that adding a polling point there will not degrade performance in any measurable way.

gasche commented 3 years ago

Thanks both for your comments. I wondered about the effect of polls in caml_apply due to this part-of-explanation of Sadiq:

The main difference between polling and trunk is that caml_apply3 gets a poll. It spends ~95% of it's time in a very hot path (dot -> fold_left2 -> caml_apply3).

You both agree that the outlier-ness of this benchmark is not a sign that we should not poll in caml_apply. So, the second natural idea is: could you avoid caml_apply3 at all in this benchmark, and see whether it remains an outlier? If I understand correctly, the only necessary change to qr_decomposition.ml would be to replace

let dot = fold_left2 (fun acc xi yi -> acc +. xi *. yi) 0.0

with

let dot x y = fold_left2 (fun acc xi yi -> acc +. xi *. yi) 0.0 x y

lthls commented 3 years ago

In this particular case, the caml_apply3 call comes from the code of fold_left2, so I don't think eta-expanding dot will help (though it's obviously a good idea). I think you could remove the caml_apply3 with flambda (with the right parameters), but otherwise your only solution is to manually specialise the fold_left2 function to its argument.

stedolan commented 3 years ago

A simple strategy that might help here:

Do not insert polls in caml_applyN
Insert a poll in non-allocating functions that end in a tail call which is indirect, a forward/recursive reference, or to caml_applyN

This removes a poll from caml_applyN at the cost of inserting one in some callers. However, callers which call caml_applyN in non-tail position won't have to poll.

The justification for this: we make a total order over functions, and the idea is to insert polls on tail calls that are forward references in this ordering, to ensure that a cycle of tail calls includes at least one poll. In the current version, that means putting polls on recursive tail calls (which are forward references), indirect tail calls (which might be forward references), but not direct tail calls to previously-defined functions (which are backward references).

The change I'm suggesting here is to consider caml_applyN to be defined at the very end of this ordering. This has two consequences: (1) tail calls to caml_applyN are always forward references needing polls, and (2) indirect tail calls within caml_applyN are backward references not needing polls, since there's nothing ahead of them they could call.

(I am assuming here that we can't end up with a cycle of tail caml_applyN calls with no intervening user code. I can't see how to cause such a thing, but maybe I'm being insufficiently evil)

lthls commented 3 years ago

(I am assuming here that we can't end up with a cycle of tail caml_applyN calls with no intervening user code. I can't see how to cause such a thing, but maybe I'm being insufficiently evil)

I am sufficiently evil:

type 'a t = T of ('a t -> 'a -> 'a) [@@unboxed]
let f (T g) x = g (T g) x
let _ = f (T f) 0

lthls commented 3 years ago

Wait, I think that's actually wrong: f contains an indirect call, so will have a poll point. Please forget about this example (but I'll keep looking though).

xavierleroy commented 3 years ago

I took the liberty of removing the "Draft" mark from this PR, which many people interpret as "not ready for review".

This PR is high on the roadmap for OCaml 4.13 and very much in need of reviews. Let's get started!

sadiqj commented 3 years ago

I took the liberty of removing the "Draft" mark from this PR, which many people interpret as "not ready for review".

This PR is high on the roadmap for OCaml 4.13 and very much in need of reviews. Let's get started!

Thanks @xavierleroy

The most impactful area for review at this point are the changes to the Mach IR, as changes to those at a later stage will create a lot of work on the architectural ports. Specifically we add an Ipollcall op and an Ipolltest test.

Ipolltest checks the young limit and can be used in an Iifthenelse with an Ipollcall on the then arm. This is placed at the end of loops and (in concert with the linearise pass) results in the compact sequence mentioned earlier on in the PR.

Ipollcall can also be used on it's own and this is what is used for function prologue polls.

There are some small changes in linearize to tighten up the jumps we emit at the end of loops. as well.

So I think the questions at the moment that block the rest of the ports are: is there consensus on the changes to Mach and do we want safepoints on 32-bit platforms?

sadiqj commented 3 years ago

Thanks @damiendoligez for taking a detailed look at this.

You're right, the path_polls and requires_prologue_poll functions have issues. They've been iterated on several times since this PR landed but need an overhaul. I would also love to have a better way of thoroughly testing them.

I'm going to go through your comments on files individually, the ARM64 comparison issue is a nice spot though.

Will also follow up with answers to your questions on poll placement for tail calls and allocating handlers.

sadiqj commented 3 years ago

For Itailcall: Did you consider using the "poll just before the edge" strategy?

No, I hadn't considered it and there is definitely an argument for treating tail calls in a similar manner to loop back edges.

Having had a ponder today - I think this works within our current placement scheme and would mean fewer executed polls in tail recursive functions that have a non-allocating base case. It complicates the rules for placement slightly but in practical terms makes no difference for reasoning about atomicity (the function you are making the call to might have a prologue poll anyway).

The downside though is when we have functions with multiple paths that result in tail calls, we end up increasing code size as we need polls on each one. I think with the performance numbers showing the impact on execution times showing the impact from polls is minimal, we should aim for strategies that minimise code size.

For Icatch: How about suppressing the poll when all the paths leading to the Iexit allocate, as well as when the handler itself allocates?

I don't think I understand this case, as if the handler unconditionally allocates at the moment the poll should be elided. Would it be possible to provide a code example so I can see where this would kick it? I'm all for eliding more polls when we safely can.

damiendoligez commented 3 years ago

I don't think I understand this case, as if the handler unconditionally allocates at the moment the poll should be elided. Would it be possible to provide a code example so I can see where this would kick it? I'm all for eliding more polls when we safely can.

For an example, you'll need a recursive Icatch with two handlers. (I don't know in what circumstances the compiler generates those, so I can't give source code.)

Handler 1 just does Iexit 2
Handler 2 does allocate; Iexit 1

You don't insert a poll before Iexit 2 because its handler allocates unconditionally. If I'm not mistaken the current version of the code would insert a poll before Iexit 1 (because handler 1 doesn't allocate) although it's not needed.

mshinwell commented 3 years ago

At present I don't think the middle-ends ever generate recursive catches with more than one handler. As far as I recall the support was added because it was expected to be needed for Flambda, which it never was (although Flambda 2 will exploit it).

damiendoligez commented 3 years ago

I've made a pull request on this pull request (https://github.com/sadiqj/ocaml/pull/3), currently running some benchmarks to see if it improves the number of poll insertions.

damiendoligez commented 3 years ago

The code size increase is almost exactly the same. I'll try to remove the polls on raise to see if it makes a big difference.

damiendoligez commented 3 years ago

Here are the results without poll-on-raise. Much better as far as code size is concerned. Brown is with poll-on-raise, blue is without. The baseline is without any polling. safepoints-poll-on-raise

sadiqj commented 3 years ago

Here are the results without poll-on-raise. Much better as far as code size is concerned. Brown is with poll-on-raise, blue is without. The baseline is without any polling.

Thanks for running these numbers @damiendoligez . Just to clarify, both brown and blue are with your safepoints branch but blue does not add a poll on raise?

If so, I think strictly for poll-on-raise Iraise in path_approx needs to be Alloc. I would be surprised if this made a massive difference to density though, it would only help in situations where you were unconditionally raising.

damiendoligez commented 3 years ago

Thanks for running these numbers @damiendoligez . Just to clarify, both brown and blue are with your safepoints branch but blue does not add a poll on raise?

In fact brown is your branch and blue is the latest version of my branch.

If so, I think strictly for poll-on-raise Iraise in path_approx needs to be Alloc. I would be surprised if this made a massive difference to density though, it would only help in situations where you were unconditionally raising.

I was doing that in my first version (the one that gives almost identical results with your branch).

mshinwell commented 3 years ago

I am now reviewing the other parts of this patch (everything except Polling, although I'll look at that too in due course).

damiendoligez commented 3 years ago

I've got a segfault in one of the test files. The problem, as far as I understand it:

We compute fun_contains_calls in selection/selectgen
We insert polls way later (after comballoc, CSE, deadcode)

The test function is just an empty loop, so its fun_contains_calls field is false and it doesn't get a stack alignment prologue. But it gets a poll inserted, which does call caml_call_gc and eventually crashes (when llvm, being clever, uses SSE/AVX instructions to do FP computations).

sadiqj commented 3 years ago

The test function is just an empty loop, so its fun_contains_calls field is false and it doesn't get a stack alignment prologue.

Hrm, that's not good.

We could set fun_contains_calls to true if we ended up inserting a poll in a function but that means for any analysis between selectgen and polling, decisions could be made on that field being false.

Alternatively we could align the stack in the emitter based on fun_contains_calls or whether there was a poll in the IR.

Neither of those seem like great choices though.

xavierleroy commented 3 years ago

My impression is that computing fun_contains_calls early, during instruction selection, is a historical accident: it looks like fun_contains_calls is used only in the code emitters, at the very end of the back-end, so we could compute it late, e.g. during linearization.

ocaml / ocaml

Safepoints #10039

Why safepoint polls are necessary

How this PR implements safepoints

Optimisations

Polling at loop back edges

Alternative approaches

Feeley’s algorithm (non-mimimal)

Instrumentation at Lambda

Using OS signals

Performance

Why concrete performance numbers are hard

AMD Zen2

Intel Skylake

Code size impact

Eliding and leaf function optimisations

Things to consider

Disabling polling (via an annotation or otherwise)

Possible future optimisations

Interprocedural optimisations

Non-noalloc calls

How to read this PR

Feeley’s algorithm (non-mimimal)

Benchmarks

Runtime

Code size

Next steps