curious atomics question / potential small inefficiency in ramalhete_queue

goldslope commented 4 years ago

    auto value = h->entries[idx].value.load(std::memory_order_relaxed);
    if constexpr(pop_retries > 0) {
      unsigned cnt = 0;
      ramalhete_queue::backoff retry_backoff;
      while (value == nullptr && ++cnt <= pop_retries) {
        value = h->entries[idx].value.load(std::memory_order_relaxed);
        retry_backoff(); // TODO - use a backoff type that can be configured separately
      }
    }

    if (value != nullptr) {
      // (14) - this acquire-load synchronizes-with the release-CAS (8)
      h->entries[idx].value.load(std::memory_order_acquire);
      traits::store(result, value.get());
      return true;
    }

^ I noticed this pattern in the ramalhete_queue. I am curious as to why you don't use std::memory_order_acquire in the loop and instead you reload the value after the loop is over.

Is there is a potential system-wide cost (or maybe a cache-line cost) to spamming load(std::memory_order_acquire)? That makes sense if this were the case.

I was under the impression std::memory_order_acquire will only affect the current thread (i.e. preventing speculative reads), and in the case of a tight spin loop, it may not be a bad thing, because it may even throttle the thread in a beneficial manner which can help for hyperthreading or power consumption.

The other more obvious benefit is that you wouldn't have to load the value twice in your fast path. It's probably not a big deal since L1 cache on most systems is super fast, but still, I'm not sure I see the point in doing so.

Forgive me if this sounds nit-picky, I am more over asking this question to learn rather than to hyper-optimize your queue. I am working on my own queue so that's why I am curious. Thanks for your help!

mpoeter commented 4 years ago

This is actually a really good question, so no need to ask for forgiveness! :-)

The x86 memory model is very strong - an aligned mov instruction is not only atomic, but provides (stronger than) memory_order_acquire semantics. However, on architectures with weaker memory models like ARM and POWER, an acquire-load is more expensive than a relaxed one.

So for x86, the additional load would be unnecessary, while on other architectures it might make sense to pay the price for the acquire-load just once. But I did not want to introduce such architecture specific micro-optimizations.

Unfortunately I do not have access to ARM or POWER machines to measure the performance difference between a relaxed-load and an acquire-load, so I am not sure if this "optimization" really gains anything, but I figured (just like you) that an additional load that hits the L1 cache should be fine.

goldslope commented 4 years ago

Awesome! Thank you for your in-depth response. This library looks super good btw. I especially love the numbered synchronization comments for each barrier. I'm probably going to shamelessly copy that commenting style for the queue that I'm writing. Anyway, I'll go ahead and close the issue. Thanks again!

mpoeter commented 4 years ago

Thank you for the nice feedback! I came up with this numbering scheme because in my thesis I needed to provide more formalized correctness arguments, and for that I needed to reference the different operations that participate in the happens-before relations. It proved to be quite helpful in general, so I stick with it. :-)

Let me know once your queue is finished, I could take a look at it if you like.

goldslope commented 4 years ago

Great! I might take you up on that. In case you are curious, its here. It is complete algorithmically, and currently works in a single thread (haven't tested anything else) I still have a few enhancements to make, but all the happens-before and RMW operations are in place. I would however wait until I annotate the comments like you did, to save you time in reading through it. Thanks so much.

mpoeter / xenium

curious atomics question / potential small inefficiency in ramalhete_queue #14