IOUringEventLoop nextWakeupNanos

pveentjer commented 2 years ago

In the IOUringEventLoop the nextWakeupNanos is updated twice:

nextWakeupNanos.set(curDeadlineNanos);
...
if (nextWakeupNanos.get() == AWAKE || nextWakeupNanos.getAndSet(AWAKE) == AWAKE){  
   pendingWakeup = true;
}

The nextWakeupNanos is an AtomicLong and the set and getAndSet are not very cheap. On the X86 the trigger a memory fence which effectively causes the CPU to stop issuing loads till the store buffer has been drained. And this will not benefit performance.

Based on the code it seems that the nextWakeupNanos is not used by any other thread. Perhaps I'm missing something since I didn't study the code in-depth, but I think the nextWakeupNanos could be converted to a simple plain long.

THe IOUringSubmissionQueue already has the appropriate fences.

franz1981 commented 2 years ago

Hi @pveentjer : nextWakeupNanos is used by https://github.com/netty/netty-incubator-transport-io_uring/blob/98f7b6d716a2043367a2d6415629b13e88b06a73/transport-classes-io_uring/src/main/java/io/netty/incubator/channel/uring/IOUringEventLoop.java#L361 from outside the event loop (ie from another thread) if inEventLoop == false

I believe this could be changed in order to just use a plain get somewhere, but I believe using eventFd to wake up the EL is such costy that we rather prefer to pay the full barrier cost.

on https://www.scylladb.com/2018/02/15/memory-barriers-seastar-linux/ I see that there are ways to save a(n additional) full barrier there, so happy to review any PR to improve this

pveentjer commented 2 years ago

Ok. In that case we need something more powerful than a plain long. But I think we can get rid of the fences. So a get/set opaque should be sufficient since there are no memory ordering effects required.

pveentjer commented 2 years ago

So the following should be sufficient for the first set:

nextWakeupNanos.lazySet(curDeadlineNanos);

It will provide a [StoreStore][LoadStore] which on the X86 you get for free since all stores are release stores. So it is just a compiler restriction.

setOpaque would be more appropriate, but not available on Java 8.

Need to think about the getAndSet.

pveentjer commented 2 years ago

With Scylla the mem barrier is removed for task exchange between threads. Memory ordering matters in that case. But afaik for the nextWakeupNanos memory ordering is irrelevant. So I don't see a good reason why a price needs to be paid for memory ordering if it is irrelevant.

normanmaurer commented 2 years ago

@pveentjer PRs welcome

pveentjer commented 2 years ago

I would be more than happy to make a PR :) But before I do, I want to make sure my understanding is correct.

Assumption: the memory ordering effects of nextWakeupNanos are irrelevant. We just need to make sure that the compiler doesn't optimize out the load or store.

So is this assumption correct?

njhill commented 2 years ago

Hey thanks @pveentjer. The IOUringEventLoop logic was derived from EpollEventLoop, and in an earlier iteration of that I did actually used lazySet here following similar reasoning to yours.

However I changed it to set after realizing that ordering is in fact important relative to the task queue submission. Otherwise I think a race / missed wake-up could be possible.

But it's possible I was wrong about this, I'm sure your understanding of the memory order modes is better than mine!

franz1981 commented 2 years ago

@njhill Oooops I didn't notice you pinged me on https://github.com/netty/netty/pull/9561#issuecomment-532926597 :( I will use this chance to help @pveentjer to find why/if it hasn't worked

njhill commented 2 years ago

@franz1981 np I think it was just for info and it was quite a long time ago now!

@pveentjer you also probably noticed we guard the getAndSet with a volatile read to avoid its cost in most cases. If you think an AtomicInteger would be significantly cheaper than AtomicLong we may be able to do find a way to use that instead.

netty / netty-incubator-transport-io_uring

IOUringEventLoop nextWakeupNanos #148