Avoid per-iteration read barriers in worker loop

typelevel / cats-effect

The pure asynchronous runtime for Scala

https://typelevel.org/cats-effect/

Apache License 2.0

2k stars 513 forks source link

Avoid per-iteration read barriers in worker loop #3544

Open armanbilge opened 1 year ago

armanbilge commented 1 year ago

Spinning out of https://github.com/typelevel/cats-effect/pull/3499#discussion_r1170622173. These are the ones I'm aware of:

https://github.com/typelevel/cats-effect/blob/ab5d56c6a6cb2ef7484e5ecb881ecf6b85ff61dc/core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala#L419

https://github.com/typelevel/cats-effect/blob/ab5d56c6a6cb2ef7484e5ecb881ecf6b85ff61dc/core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala#L689-L694

djspiewak commented 1 year ago

Just to preserve some context…

There are two things here that are both independently challenging. The first thing that is challenging is the done.get() call, which is hard to fake around. We use this to detect when the pool is shutting down, so we should murderize the worker threads. In this case, the best we can do might end up being something a bit like what we do for canceled in the fiber.

The second thing here is the sleepers polling. This has to be a memory barrier in the present encoding because we allow theft of timers by other workers, meaning that we need to be atomic with respect to that theft. This is a little different than how we handle it in the work queue, because the work queue is stolen from one end while we dequeue from the other, which in turn means that we can play games with the memory barriers. When we poll and steal from the same side, there's just not a lot that can be done: we have to cross a barrier to ensure atomicity.

armanbilge commented 1 year ago

If I may suggest the dumb thing: why don't we just check both at a lesser granularity? e.g. every 64 iterations, instead of every iteration.

djspiewak commented 1 year ago

If I may suggest the dumb thing: why don't we just check both at a lesser granularity? e.g. every 64 iterations, instead of every iteration.

This is definitely where we should start. It has some slightly subtle implications for done.get() but nothing that should be problematic, I think.

armanbilge commented 1 year ago

When we poll and steal from the same side, there's just not a lot that can be done: we have to cross a barrier to ensure atomicity.

Here's an idea: we don't have to cross a barrier to be sure if no timers have expired. So long as the timers queue is only written by its current thread, then the time of the next expiration can be treated as a thread local. If there was possibly an expiration, then we can cross the read barrier to find out. On second thought, in practice this might not be useful, if this thread local doesn't get updated for canceled timers (and also stolen timers, but I assume that would be more rare), so it would end up crossing the read barrier all the time anyway ... 🤔

Btw, this also raises the question: do we really want to do System.nanoTime() on every iteration? I thought this was a non-trivial op as well.

djspiewak commented 1 year ago

Here's an idea: we don't have to cross a barrier to be sure if no timers have expired. So long as the timers queue is only written by its current thread, then the time of the next expiration can be treated as a thread local.

This is super smart. Pair this with another assumption (which holds!): only the local thread can decrease the next expiration. When timers are stolen, it just means that the expiration increases, meaning that we would hit the threshold, spuriously check the queue (crossing a memory barrier), discover that we were wrong to wake up and then update the local threshold again.

armanbilge commented 1 year ago

In https://github.com/typelevel/cats-effect/issues/3445#issuecomment-1586367793 I describe how we can avoid a read-barrier related to cedeBypass.