Keep pool of systhreads for blocking operations

SGrondin commented 7 months ago

It works (on all backends!) and stress testing hints at it being free of race conditions. Addresses #607

This is still a work in progress, any and all suggestions and criticisms are welcome.

It has to keep a separate threadpool per domain as per Interaction with systhreads or else things quickly deadlock:

The systhreads created on a particular domain remain pinned to that domain. Only one systhread at a time is allowed to run OCaml code on a particular domain.

General thoughts:

I picked run_in_systhread as the place to implement this feature to allow all backends to benefit from it since they all call Eio_unix.run_in_systhread.
- Would it be better to implement it in the scheduler as this WIP branch does it?
I selected a scaling strategy that's totally different from Lwt's.
- Lwt creates a new systhread whenever all systhreads are busy, except when there's already 1000 busy systhreads, in which case Lwt performs the job synchronously/blocking!
- In comparison, this PR's threadpool will always make a new thread if all systhreads are busy. It then keeps up to ~25~20 threads to wait for new jobs (per domain).
The fact that these threadpools are per-domain obviously runs the risk of increasing OS scheduler overhead due to the sheer number of threads we might end up creating.
- I'm hoping that users will leverage the Executor_pool for the majority of their parallelism needs because it maximizes reuse of domains and therefore systhreads.
~I'm thinking it might be interesting to terminate threads that have been idle for a long time. It's worth trying and benchmarking, at least.~
- This is now implemented
~Should Eio users be able to call run_in_systhread? Before OCaml 5 I frequently used Lwt_preemptive.detach to call C code that releases the runtime lock. Now users can just run that code on a separate domain. Would there be any benefit executing such code on a pooled systhread instead as opposed to e.g. Executor_pool.submit_exn ~weight:0.0?~
- Users already have access to it: Eio_unix.run_in_systhread

SGrondin commented 6 months ago

I had to revert back to doing reads and writes inside a single [@poll error] section because I would occasionally see extreme contention (100ms+) with retry-loop CAS code. This latest commit automatically scales down a pool that doesn't enough usage. I had to implement Condition.timed_wait (using pthread_cond_timedwait) in order to get semaphores with timeouts. If timed semaphores stay in this PR until it's merged then I'll open a PR to add all that to OCaml itself. As Xavier pointed out, it's a desired feature, they just need someone to take the time to do it.

talex5 commented 6 months ago

I would occasionally see extreme contention (100ms+) with retry-loop CAS code.

What are you using to test this? We only use sys-threads to run system calls, which take a relatively long time, so I wouldn't expect to see much contention here.

SGrondin commented 6 months ago

@talex5 stress/stress_systhreads.ml, added in this PR.

It's meant to be a worst case scenario: eio_posix so no io_uring, 5 domains on a 4 core CPU, 100 fibers per domain, every fiber simultaneously writing to the disk with file contention as well. It also recreates the Executor_pool -and therefore its domains- every round to try to create contention in each domain's systhread pool due to the rapid scaling up and down.

All those syscalls starting nearly simultaneously was causing extreme contention in the CAS loops so I rewrote just those critical sections back to the original style and compared the 2 versions. The difference was visible at human scale! I even saw the CAS loop version stall for close to half a second once. In comparison, the current code never fails to write and runs perfectly smooth. The only downside is the visual ugliness of the unrolled loop; it's more predictable, more performant, less overhead, etc.

As a side note, this PR led to my first ocaml repo PR! https://github.com/ocaml/ocaml/pull/12867 Please take a look if you're familiar with the posix, time and timers area of the codebase.

polytypic commented 5 months ago

I'm going to do some benchmarking to try to get to the bottom of this.

👍 I'd also love to understand this phenomena better.

My assumption is that when you use more domains than there are hardware threads then the OS will have to switch between domains. What I assume then happens is that the mechanism that OCaml uses internally to switch between systhreads doesn't notice this and basically switches between systhreads prematurely before a systhread has really had a chance to run much at all, because the domain wasn't running (some other domain was running instead). This could indeed cause very long delays. However, this is just my speculation — I have not yet verified whether this is actually what the runtime would effectively do. If my speculation is roughly correct, then, first of all, one really should not spawn more domains than there are hardware threads. Second, it might be useful to consider improvements to the the internal ticking mechanism that OCaml uses to switch between systhreads on a domain to take into account the possibility that a domain might not actually get CPU time during the tick interval.

SGrondin commented 5 months ago

What I assume then happens is that the mechanism that OCaml uses internally to switch between systhreads doesn't notice this and basically switches between systhreads prematurely before a systhread has really had a chance to run much at all, because the domain wasn't running (some other domain was running instead).

This is my theory as well.

talex5 commented 5 months ago

I started writing a benchmark and quickly discovered a strange delay joining domains with systhreads that was skewing the results: https://github.com/ocaml/ocaml/issues/12948

domain-join

SGrondin commented 5 months ago

@talex5 I just pushed a commit that replaces both available: bool array; and n_available: int with a bitfield as per @art-w's suggestion in the dev meeting. It allowed me to remove that horrible unrolled loop. Bitfields introduce some bit twiddling complexity, but I think it's worth it overall.

polytypic commented 5 months ago

Hmm... Note that within a [@poll error] function you can do quite a bit aside from allocating. So, if you just move the allocations outside of the [@poll error] function you can, for example, manipulate a dynamic list of nodes. IOW, you can, for example, implement a wait-free systhread safe bounded stack/cache as a linked list:

module Wait_free_systhread_safe_bounded_stack : sig
  type 'a t

  val create : capacity:int -> 'a t
  val try_push : 'a t -> 'a -> bool
  val pop_opt : 'a t -> 'a option
end = struct
  type ('a, _) node =
    | Null : ('a, [> `Null ]) node
    | Node : { mutable next : 'a link; value : 'a } -> ('a, [> `Node ]) node

  and 'a link = Link : ('a, [< `Null | `Node ]) node -> 'a link [@@unboxed]

  type 'a t = { capacity : int; mutable count : int; mutable nodes : 'a link }

  let create ~capacity = { capacity; count = 0; nodes = Link Null }

  let[@poll error] try_push t (Node node_r as node : (_, [< `Node ]) node) =
    t.count < t.capacity
    && begin
         t.count <- t.count + 1;
         node_r.next <- t.nodes;
         t.nodes <- Link node;
         true
       end

  let try_push t value = try_push t (Node { next = Link Null; value })

  let[@poll error] pop_opt t =
    match t.nodes with
    | Link Null -> Null
    | Link (Node node_r as node) ->
        t.count <- t.count - 1;
        t.nodes <- node_r.next;
        node

  let pop_opt t =
    match pop_opt t with Null -> None | Node node_r -> Some node_r.value
end

You should also be able to e.g. implement a wait-free systhread-safe doubly-linked list this way (just move the allocations outside of the [@poll error] functions). This way a thread could remove its own mailbox. So, instead of a fixed size array with an allocation bitfield, you could also use a dynamic data structure. But perhaps it would actually make sense to have a job queue isntead of a set of mailboxes.

talex5 commented 5 months ago

I rebased on the benchmark commit and collected some measurements in #679:

systhread-pool

The commits are (from left to right):

964ed27 is before the thread-pool
5b39011 (with the branch icon) is "Keep pool of systhreads for blocking operations"
0807692 is "Threadpool: Use safe-points instead of atomics"
006afcb is "Minor thread pool cleanups": the tip of this PR
f7a6c85 is identical, to check if 4 was a fluke

So from this it looks to me like:

Thread-pooling helps a lot (2 to 3 times faster)
Getting rid of atomics makes no detectable difference (given there is some noise anyway)

This seems very reasonable to me, but if you can modify the benchmark to show the latency problem you saw, or capture a trace with eio-trace record, that would be useful. Otherwise, I'll probably go back to the simpler version.

SGrondin commented 5 months ago

Closing in favor of #681

talex5 commented 5 months ago

BTW, I found out I why this PR wouldn't build on my machine: https://github.com/ocaml/dune/issues/9979

ocaml-multicore / eio

Keep pool of systhreads for blocking operations #658