Better Safe Points - Githubissues

kayceesrk commented 6 years ago

Multicore OCaml requires that the mutator regularly polls for inter-domain interrupts. Trunk OCaml polls for interrupts already in order to check for pending signals. The runtime system notifies the presence of some pending interrupt by making the allocation fail, so that the next time the mutator performs an allocation, the allocation fails and the mutator checks for any pending interrupts. However, this alone is not sufficient to ensure timely handling of interrupts since the mutator might spin in a non-allocating loop. This can cause surprising behaviours on trunk MPR#7128 and MPR#3747, but is essential for multicore to not live-lock.

@dc-mak has implemented Feely's algorithm which implements efficient reliable polling. This needs to be upstreamed to fix MPR#7128 and MPR#3747 and incorporated into multicore branch. In order to cleanly implement @dc-mak's changes upstream, Addr value kind in cmm has to be eliminated.

stedolan commented 6 years ago

Summary of current patch

The current patch aims to insert polling points at regular intervals in the code, which will allow a guaranteed interrupt at fixed delays. Feely's paper details on where exactly to insert poll points and also, how to deal with loops, branches etc. Based on the maximum latency of interrupt required, the polling positions can be changed, to allow different performance behaviors.

Allocations/Function Calls v/s Inserted Polls

Currently the patch inserts safe-points after a fixed number of instructions. The presence of allocations in these instructions makes these insertions redundant, as each allocation point can be considered as a poll.

Also, from another issue -

except in the rare case of a loop that doesn't allocate and doesn't call any function and is compiled with ocamlopt.

Do function calls also check for pending interrupts? If yes, then can this point also be considered as a polling point?

Thus, at every Calloc block or function call, he counter for tracking number of instructions elapsed without a poll point should be reset.

Question? Are there any other points where we need to check that an allocation is happening? For e.g. for multicore, do we need to check at Cload and Cstore as well?

Prevent polling in address generation

Polling during address generation can cause a bad GC root in the live registers. Thus the way it is currently implemented is to check if the current computation results in a Addr value and then not insert a poll point there.

Question? Is this the best way to do this? Look at next point for more information.

Prevent polling in critical sections

Certain code must be written to prevent allocations. This also implies that there are no interrupts triggered in this sequence. To take care of this, we propose function annotations that declare that a particular function not be instrumented with a polling point. For safety, if the number of instructions in the function crosses the limit of latency, the polling point will be inserted after the function exits.

Can we use the same mechanism to tag the address generation instruction, instead of manually writing the case out?

Polling sequence

The current polling sequence is a zero-byte float array allocation. Can we add a construct dedicated for polling?

CMM->CMM or CMMgen

There is an open question if we should implement better safe points emission as part of the CMM-gen itself or keep it as a cmm->cmm pass. I will be updating the issue with more points and code sections.

mshinwell commented 4 years ago

A few thoughts:

Adding the ability to denote a begin/end block as non-interruptible seems reasonable. I think this is better than a per-function annotation.
I'm not sure exactly what the current code does, but it doesn't seem sufficient to check if "the current computation results in an Addr value". It needs to make sure that no such values are live at the polling point.
Adding a dedicated polling Cmm operation seems sensible.
There is an existing upstream PR to split Cmmgen into two: a library + the transformation from Clambda to Cmm. The reason is that for Flambda 2.0 we are going to generate Cmm directly from Flambda (by using this library). As such, it would be good if the code for inserting the safe points could form part of that library, perhaps? It's probably not necessary to have a separate pass for inserting them.
A migration plan should be thought about, since this could silently break code in bad ways.

anmolsahoo25 commented 4 years ago

    | Clet (id, exp1, exp2) ->
        (* Don't poll when an address is being computed, otherwise bad GC root (emit.mlp) *)
        begin match exp1 with
        | Cop(Cadda, _, _) -> (delta := !delta + 1; Clet (id, exp1, insert_poll exp2))
        | _ -> let exp1 = non_branch exp1 in Clet(id, exp1, insert_poll exp2)
        end

This is the part where it is being mentioned that a poll should not be added. I inferred this to mean that while a computation of an address is going on, a poll should not be inserted.

Any thoughts?

Gbury commented 4 years ago

About the cmm->cmm vs cmmgen part: as shown in https://github.com/ocaml/ocaml/pull/1660 (and https://github.com/ocaml/ocaml/blob/776fcc39194b84314520429fb82f0f51c35401bf/asmcomp/selectgen.ml#L599 ), the type of registers is not completely known in the cmm representation, which makes ensuring that no Addr value is live at polling points complicated. Combined with the fact that flambda2 will directly target cmm, it could be interesting to see whether inserting polling points could be done at a lower level than cmm ? While on that subject, I couldn't find any mention of how this interacts with the comballoc pass, which can move some allocations iirc, which could maybe mess with the instruction count between polling points ?

gadmm commented 4 years ago

No_poll blocks

For safety, if the number of instructions in the function crosses the limit of latency, the polling point will be inserted after the function exits.

The current predictable polling behaviour has a crucial property: the set of possible polling points is discrete inside blocks; there is no polling between blocks. This guarantees that successive non-polling blocks fuse together to form non-polling sections, and also extends the programming and reasoning about non-polling behaviour to the giving, receiving and returning of values. Inserting a polling point right at the end of no-poll blocks jeopardizes this property.

One way to recover the good behaviour is (IIUC) to restart at zero instruction count at the end of a no-poll block rather than polling right at the end of a no-poll block.

Examples: take the following variations on compare and set, currently atomic with systhreads:

let cas r x y =
  let res = !r = x in
  if res then r := y ;
  res

let cas_cont r x y g h =
  if !r = x then (r := y ; (* may poll *) g ())
  else (* may poll *) h ()

let cas_loop r x y f =
  while !r <> x do (* may poll *) f () done ;
  r := y

let cmp a b = (* ... non-polling series of comparisons ... *)

let cas_cmp r x y =
  let res = cmp !r x in
  if res then r := y ;
  res

let assign a b = (* ... non-polling series of assignment ... *)

let cas_assign r x y =
  let res = r = x in
  if res then assign r y ;
  res

With explicit no-poll blocks, you would write:

let cas r x y =
  let res = !r = x in
  if res then r := y ;
  res
  [@no_poll]

let cas_cont r x y g h =
  (* no polling between the two no-poll blocks *)
  if !r = x [@no_poll] then ((r := y [@no_poll]) ; (* may poll *) g ())
  else (* may poll *) h ()

let cas_loop r x y f =
  while !r <> x [@no_poll] do (* may poll *) f () done ;
  (r := y [@no_poll])

let cmp a b = (* ... non-polling series of comparisons ... *) [@no_poll]

let cas_cmp r x y =
  let res = cmp !r x in
  if res then r := y ;
  res
  [@no_poll]

let assign a b = (* ... non-polling series of assignment ... *) [@no_poll]

let cas_assign r x y =
  let res = r = x in
  if res then assign r y ;
  res
  [@no_poll]

This shows fusing of no_poll blocks, and non-polling when giving, receiving and returning values.

Since one has to think about backwards-compatibility, having a model which is close in spirit to the previous one is going to help transition (but there are other things to worry about regarding compatibility).

The finer granularity could also be useful for program transformations, for the programmer or if the polling points are introduced late in the compiler pipeline. Otherwise I fear that the block might act as a barrier to optimisations. For instance one might want at some point to be able to rewrite (if b then ... else ...)[@no_poll] into (if b [@no_poll] then ... [@no_poll] else ... [@no_poll]) to be able to replace b with its known value; this transformation is obviously false if you allow yourself to poll at the end of the no-poll blocks.

The nice thing is, IIUC you should have this good behaviour if you simply restart the instruction count at 0 at the end of a no_poll block.

I found further examples of critical sections when implementing something like Haskell's bracket (look for "critical section" in https://github.com/ocaml/ocaml/pull/8962/files). In particular I have critical sections of the following form (annotated):

try ...
with e -> ((set_mask () [@no_poll]) ; ...)

try ...
with e -> (unset_mask () ; raise e [@no_poll])

Here there should be no polling between the moment the exception is raised and set_mask is called, or unset_mask is called and the exception re-raised.

Polling after C calls

It is a good idea to add a polling point after every non-noalloc C call (but not after noalloc C calls). The reason is that C functions ~no longer~ do not poll implicitly on allocation.

Marshalling and polymorphic compare

Is somebody working on adding polling points inside polymorphic compare and marshalling (which currently do not poll, if I am not mistaken)? If not should we open a bug report?

Dedicated polling instruction

The current polling sequence is a zero-byte float array allocation. Can we add a construct dedicated for polling?

Ideally, polling should compile to a comparison of young_limit to the register caching the value of young_ptr. To achieve this, it is not necessary to have a dedicated construct for polling. AFAIU it is simpler to optimise the case n = 0 in the emitted asm for allocation.

Function calls

Do function calls also check for pending interrupts? If yes, then can this point also be considered as a polling point?

I did not see any indication of that and I believe not (I'd need to know if I am wrong though). Maybe the quote was referring to the fact that we do not know whether the callee itself allocates.

I will be updating the issue with more points and code sections.

Just be aware that we do not get e-mail notifications on edits, I originally missed your edit.

anmolsahoo25 commented 4 years ago

@gadmm @Gbury , many thanks for the insightful points!

Thanks for mentioning about the email notification, I will rather update a new comment, instead of editing.

avsm commented 3 years ago

Now being worked on upstream for OCaml 4.13: https://github.com/ocaml/ocaml/pull/10039

kayceesrk commented 2 years ago

Fixed by #573.

ocaml-multicore / ocaml-multicore

Better Safe Points #187