ocaml-multicore / ocaml-multicore

Multicore OCaml
Other
762 stars 68 forks source link

Better Safe Points #187

Closed kayceesrk closed 2 years ago

kayceesrk commented 6 years ago

Multicore OCaml requires that the mutator regularly polls for inter-domain interrupts. Trunk OCaml polls for interrupts already in order to check for pending signals. The runtime system notifies the presence of some pending interrupt by making the allocation fail, so that the next time the mutator performs an allocation, the allocation fails and the mutator checks for any pending interrupts. However, this alone is not sufficient to ensure timely handling of interrupts since the mutator might spin in a non-allocating loop. This can cause surprising behaviours on trunk MPR#7128 and MPR#3747, but is essential for multicore to not live-lock.

@dc-mak has implemented Feely's algorithm which implements efficient reliable polling. This needs to be upstreamed to fix MPR#7128 and MPR#3747 and incorporated into multicore branch. In order to cleanly implement @dc-mak's changes upstream, Addr value kind in cmm has to be eliminated.

stedolan commented 6 years ago

See also MPR#3747

kayceesrk commented 6 years ago

Great. Updated the original message.

kayceesrk commented 6 years ago

@stedolan There are some intricate OCaml code written such that it does not allocate in order to not trigger a GC. If we sprinkle safe points in the code, isn't there a risk of breaking such invariants? How do you guard against those?

bluddy commented 6 years ago

Perhaps a new annotation indicating an intent not to allocate? We could have a begin annotation and an end annotation. This would have the associated risk of live-locking.

dc-mak commented 4 years ago

For posterity/convenience for later Report is here Code (horribly organised) https://github.com/ocaml/ocaml/compare/trunk...dc-mak:trunk

kayceesrk commented 4 years ago

@anmolsahoo25 is now working to upstream this.

anmolsahoo25 commented 4 years ago

Some discussion points -

Summary of current patch

The current patch aims to insert polling points at regular intervals in the code, which will allow a guaranteed interrupt at fixed delays. Feely's paper details on where exactly to insert poll points and also, how to deal with loops, branches etc. Based on the maximum latency of interrupt required, the polling positions can be changed, to allow different performance behaviors.

Allocations/Function Calls v/s Inserted Polls

Currently the patch inserts safe-points after a fixed number of instructions. The presence of allocations in these instructions makes these insertions redundant, as each allocation point can be considered as a poll.

Also, from another issue -

except in the rare case of a loop that doesn't allocate and doesn't call any function and is compiled with ocamlopt.

Do function calls also check for pending interrupts? If yes, then can this point also be considered as a polling point?

Thus, at every Calloc block or function call, he counter for tracking number of instructions elapsed without a poll point should be reset.

Question? Are there any other points where we need to check that an allocation is happening? For e.g. for multicore, do we need to check at Cload and Cstore as well?

Prevent polling in address generation

Polling during address generation can cause a bad GC root in the live registers. Thus the way it is currently implemented is to check if the current computation results in a Addr value and then not insert a poll point there.

Question? Is this the best way to do this? Look at next point for more information.

Prevent polling in critical sections

Certain code must be written to prevent allocations. This also implies that there are no interrupts triggered in this sequence. To take care of this, we propose function annotations that declare that a particular function not be instrumented with a polling point. For safety, if the number of instructions in the function crosses the limit of latency, the polling point will be inserted after the function exits.

Can we use the same mechanism to tag the address generation instruction, instead of manually writing the case out?

Polling sequence

The current polling sequence is a zero-byte float array allocation. Can we add a construct dedicated for polling?

CMM->CMM or CMMgen

There is an open question if we should implement better safe points emission as part of the CMM-gen itself or keep it as a cmm->cmm pass. I will be updating the issue with more points and code sections.

mshinwell commented 4 years ago

A few thoughts:

anmolsahoo25 commented 4 years ago
    | Clet (id, exp1, exp2) ->
        (* Don't poll when an address is being computed, otherwise bad GC root (emit.mlp) *)
        begin match exp1 with
        | Cop(Cadda, _, _) -> (delta := !delta + 1; Clet (id, exp1, insert_poll exp2))
        | _ -> let exp1 = non_branch exp1 in Clet(id, exp1, insert_poll exp2)
        end

This is the part where it is being mentioned that a poll should not be added. I inferred this to mean that while a computation of an address is going on, a poll should not be inserted.

Any thoughts?

Gbury commented 4 years ago

About the cmm->cmm vs cmmgen part: as shown in https://github.com/ocaml/ocaml/pull/1660 (and https://github.com/ocaml/ocaml/blob/776fcc39194b84314520429fb82f0f51c35401bf/asmcomp/selectgen.ml#L599 ), the type of registers is not completely known in the cmm representation, which makes ensuring that no Addr value is live at polling points complicated. Combined with the fact that flambda2 will directly target cmm, it could be interesting to see whether inserting polling points could be done at a lower level than cmm ? While on that subject, I couldn't find any mention of how this interacts with the comballoc pass, which can move some allocations iirc, which could maybe mess with the instruction count between polling points ?

gadmm commented 4 years ago

No_poll blocks

For safety, if the number of instructions in the function crosses the limit of latency, the polling point will be inserted after the function exits.

The current predictable polling behaviour has a crucial property: the set of possible polling points is discrete inside blocks; there is no polling between blocks. This guarantees that successive non-polling blocks fuse together to form non-polling sections, and also extends the programming and reasoning about non-polling behaviour to the giving, receiving and returning of values. Inserting a polling point right at the end of no-poll blocks jeopardizes this property.

One way to recover the good behaviour is (IIUC) to restart at zero instruction count at the end of a no-poll block rather than polling right at the end of a no-poll block.

Examples: take the following variations on compare and set, currently atomic with systhreads:

let cas r x y =
  let res = !r = x in
  if res then r := y ;
  res
let cas_cont r x y g h =
  if !r = x then (r := y ; (* may poll *) g ())
  else (* may poll *) h ()
let cas_loop r x y f =
  while !r <> x do (* may poll *) f () done ;
  r := y
let cmp a b = (* ... non-polling series of comparisons ... *)

let cas_cmp r x y =
  let res = cmp !r x in
  if res then r := y ;
  res
let assign a b = (* ... non-polling series of assignment ... *)

let cas_assign r x y =
  let res = r = x in
  if res then assign r y ;
  res

With explicit no-poll blocks, you would write:

let cas r x y =
  let res = !r = x in
  if res then r := y ;
  res
  [@no_poll]
let cas_cont r x y g h =
  (* no polling between the two no-poll blocks *)
  if !r = x [@no_poll] then ((r := y [@no_poll]) ; (* may poll *) g ())
  else (* may poll *) h ()
let cas_loop r x y f =
  while !r <> x [@no_poll] do (* may poll *) f () done ;
  (r := y [@no_poll])
let cmp a b = (* ... non-polling series of comparisons ... *) [@no_poll]

let cas_cmp r x y =
  let res = cmp !r x in
  if res then r := y ;
  res
  [@no_poll]
let assign a b = (* ... non-polling series of assignment ... *) [@no_poll]

let cas_assign r x y =
  let res = r = x in
  if res then assign r y ;
  res
  [@no_poll]

This shows fusing of no_poll blocks, and non-polling when giving, receiving and returning values.

Since one has to think about backwards-compatibility, having a model which is close in spirit to the previous one is going to help transition (but there are other things to worry about regarding compatibility).

The finer granularity could also be useful for program transformations, for the programmer or if the polling points are introduced late in the compiler pipeline. Otherwise I fear that the block might act as a barrier to optimisations. For instance one might want at some point to be able to rewrite (if b then ... else ...)[@no_poll] into (if b [@no_poll] then ... [@no_poll] else ... [@no_poll]) to be able to replace b with its known value; this transformation is obviously false if you allow yourself to poll at the end of the no-poll blocks.

The nice thing is, IIUC you should have this good behaviour if you simply restart the instruction count at 0 at the end of a no_poll block.

I found further examples of critical sections when implementing something like Haskell's bracket (look for "critical section" in https://github.com/ocaml/ocaml/pull/8962/files). In particular I have critical sections of the following form (annotated):

try ...
with e -> ((set_mask () [@no_poll]) ; ...)
try ...
with e -> (unset_mask () ; raise e [@no_poll])

Here there should be no polling between the moment the exception is raised and set_mask is called, or unset_mask is called and the exception re-raised.

Polling after C calls

It is a good idea to add a polling point after every non-noalloc C call (but not after noalloc C calls). The reason is that C functions ~no longer~ do not poll implicitly on allocation.

Marshalling and polymorphic compare

Is somebody working on adding polling points inside polymorphic compare and marshalling (which currently do not poll, if I am not mistaken)? If not should we open a bug report?

Dedicated polling instruction

The current polling sequence is a zero-byte float array allocation. Can we add a construct dedicated for polling?

Ideally, polling should compile to a comparison of young_limit to the register caching the value of young_ptr. To achieve this, it is not necessary to have a dedicated construct for polling. AFAIU it is simpler to optimise the case n = 0 in the emitted asm for allocation.

Function calls

Do function calls also check for pending interrupts? If yes, then can this point also be considered as a polling point?

I did not see any indication of that and I believe not (I'd need to know if I am wrong though). Maybe the quote was referring to the fact that we do not know whether the callee itself allocates.

I will be updating the issue with more points and code sections.

Just be aware that we do not get e-mail notifications on edits, I originally missed your edit.

anmolsahoo25 commented 4 years ago

@gadmm @Gbury , many thanks for the insightful points!

Thanks for mentioning about the email notification, I will rather update a new comment, instead of editing.

avsm commented 3 years ago

Now being worked on upstream for OCaml 4.13: https://github.com/ocaml/ocaml/pull/10039

kayceesrk commented 2 years ago

Fixed by #573.