max0x7ba / atomic_queue

C++ lockless queue.
MIT License
1.47k stars 176 forks source link
atomic atomic-queues atomics benchmarks c-plus-plus c-plusplus circular-queue cplusplus cpp data-structures datastructures high-performance lock-free lockfree lockless low-latency multi-threading multithreading queue ring-buffer-array

C++14 MIT license Latest release
Conan Center Vcpkg Version
Makefile Continuous Integrations CMake Continuous Integrations
platform Linux x86_64 platform Linux ARM platform Linux RISC-V platform Linux PowerPC platform Linux IBM System/390

atomic_queue

C++14 multiple-producer-multiple-consumer lock-free queues based on circular buffer and std::atomic. Designed with a goal to minimize the latency between one thread pushing an element into a queue and another thread popping it from the queue.

It has been developed, tested and benchmarked on Linux, but should support any C++14 platforms which implement std::atomic. Reported as compatible with Windows, but the continuous integrations hosted by GitHub are currently set up only for x86_64 platform on Ubuntu-20.04 and Ubuntu-22.04. Pull requests to extend the continuous integrations to run on other architectures and/or platforms are welcome.

Design Principles

When minimizing latency a good design is not when there is nothing left to add, but rather when there is nothing left to remove, as these queues exemplify.

The main design principle these queues follow is minimalism, which results in such design choices as:

The impact of each of these small design choices on their own is barely measurable, but their total impact is much greater than a simple sum of the constituents' impacts, aka super-scalar compounding or synergy. The synergy emerging from combining multiple of these small design choices together is what allows CPUs to perform at their peak capacities least impeded.

These design choices are also limitations:

Ultra-low-latency applications need just that and nothing more. The minimalism pays off, see the throughput and latency benchmarks.

Available containers are:

These containers have corresponding AtomicQueueB, OptimistAtomicQueueB, AtomicQueueB2, OptimistAtomicQueueB2 versions where the buffer size is specified as an argument to the constructor.

Totally ordered mode is supported. In this mode consumers receive messages in the same FIFO order the messages were posted. This mode is supported for push and pop functions, but for not the try_ versions. On Intel x86 the totally ordered mode has 0 cost, as of 2019.

Single-producer-single-consumer mode is supported. In this mode, no expensive atomic read-modify-write CPU instructions are necessary, only the cheapest atomic loads and stores. That improves queue throughput significantly.

Move-only queue element types are fully supported. For example, a queue of std::unique_ptr<T> elements would be AtomicQueue2B<std::unique_ptr<T>> or AtomicQueue2<std::unique_ptr<T>, CAPACITY>.

Role Models

Several other well established and popular thread-safe containers are used for reference in the benchmarks:

Using the library

The containers provided are header-only class templates, no building/installing is necessary.

Install from GitHub

  1. Clone the project:
    git clone https://github.com/max0x7ba/atomic_queue.git
  2. Add atomic_queue/include directory (use full path) to the include paths of your build system.
  3. #include <atomic_queue/atomic_queue.h> in your C++ source.

Install using vcpkg

vcpkg install atomic-queue

Install using conan

Follow the official tutorial on how to consume conan packages. Details specific to this library are available in ConanCenter.

Benchmark build and run instructions

The containers provided are header-only class templates that require only #include <atomic_queue/atomic_queue.h>, no building/installing is necessary.

Building is necessary to run the tests and benchmarks.

git clone https://github.com/cameron314/concurrentqueue.git
git clone https://github.com/cameron314/readerwriterqueue.git
git clone https://github.com/mpoeter/xenium.git
git clone https://github.com/max0x7ba/atomic_queue.git
cd atomic_queue
make -r -j4 run_benchmarks

The benchmark also requires Intel TBB library to be available. It assumes that it is installed in /usr/local/include and /usr/local/lib. If it is installed elsewhere you may like to modify cppflags.tbb and ldlibs.tbb in Makefile.

API

The queue class templates provide the following member functions:

Atomic elements are those, for which std::atomic<T>{T{}}.is_lock_free() returns true, and, when C++17 features are available, std::atomic<T>::is_always_lock_free evaluates to true at compile time. In other words, the CPU can load, store and compare-and-exchange such elements atomically natively. On x86-64 such elements are all the C++ standard arithmetic and pointer types.

The queues for atomic elements reserve one value to serve as an empty element marker NIL, its default value is 0. NIL value must not be pushed into a queue and there is an assert statement in push functions to guard against that in debug mode builds. Pushing NIL element into a queue in release mode builds results in undefined behaviour, such as deadlocks and/or lost queue elements.

Note that optimism is a choice of a queue modification operation control flow, rather than a queue type. An optimist push is fastest when the queue is not full most of the time, an optimistic pop - when the queue is not empty most of the time. Optimistic and not so operations can be mixed with no restrictions. The OptimistAtomicQueues in the benchmarks use only optimist push and pop.

See example.cc for a usage example.

TODO: full API reference.

Memory order of non-atomic loads and stores

push and try_push operations synchronize-with (as defined in std::memory_order) with any subsequent pop or try_pop operation of the same queue object. Meaning that:

Implementation Notes

Ring-buffer capacity

The available queues here use a ring-buffer array for storing elements. The capacity of the queue is fixed at compile time or construction time.

In a production multiple-producer-multiple-consumer scenario the ring-buffer capacity should be set to the maximum expected queue size. When the ring-buffer gets full it means that the consumers cannot consume the elements fast enough. A fix for that is any of:

Using a power-of-2 ring-buffer array size allows a couple of important optimizations:

The containers use unsigned type for size and internal indexes. On x86-64 platform unsigned is 32-bit wide, whereas size_t is 64-bit wide. 64-bit instructions utilise an extra byte instruction prefix resulting in slightly more pressure on the CPU instruction cache and the front-end. Hence, 32-bit unsigned indexes are used to maximise performance. That limits the queue size to 4,294,967,295 elements, which seems to be a reasonable hard limit for many applications.

While the atomic queues can be used with any moveable element types (including std::unique_ptr), for best throughput and latency the queue elements should be cheap to copy and lock-free (e.g. unsigned or T*), so that push and pop operations complete fastest.

Lock-free guarantees

Conceptually, a push or pop operation does two atomic steps:

  1. Atomically and exclusively claims the queue slot index to store/load an element to/from. That's producers incrementing head index, consumers incrementing tail index. Each slot is accessed by one producer and one consumer threads only.
  2. Atomically store/load the element into/from the slot. Producer storing into a slot changes its state to be non-NIL, consumer loading from a slot changes its state to be NIL. The slot is a spinlock for its one producer and one consumer threads.

These queues anticipate that a thread doing push or pop may complete step 1 and then be preempted before completing step 2.

An algorithm is lock-free if there is guaranteed system-wide progress. These queue guarantee system-wide progress by the following properties:

Preemption

Linux task scheduler thread preemption is something no user-space process should be able to affect or escape, otherwise any/every malicious application would exploit that.

Still, there are a few things one can do to minimize preemption of one's mission critical application threads:

People often propose limiting busy-waiting with a subsequent call to std::this_thread::yield()/sched_yield/pthread_yield. However, sched_yield is a wrong tool for locking because it doesn't communicate to the OS kernel what the thread is waiting for, so that the OS thread scheduler can never schedule the calling thread to resume at the right time when the shared state has changed (unless there are no other threads that can run on this CPU core, so that the caller resumes immediately). See notes section in man sched_yield and a Linux kernel thread about sched_yield and spinlocks for more details.

In Linux, there is mutex type PTHREAD_MUTEX_ADAPTIVE_NP which busy-waits a locked mutex for a number of iterations and then makes a blocking syscall into the kernel to deschedule the waiting thread. In the benchmarks it was the worst performer and I couldn't find a way to make it perform better, and that's the reason it is not included in the benchmarks.

On Intel CPUs one could use the 4 debug control registers to monitor the spinlock memory region for write access and wait on it using select (and its friends) or sigwait (see perf_event_open and uapi/linux/hw_breakpoint.h for more details). A spinlock waiter could suspend itself with select or sigwait until the spinlock state has been updated. But there are only 4 of these registers, so that such a solution wouldn't scale.

Benchmarks

View throughput and latency benchmarks charts.

Methodology

There are a few OS behaviours that complicate benchmarking:

I only have access to a few x86-64 machines. If you have access to different hardware feel free to submit the output file of scripts/run-benchmarks.sh and I will include your results into the benchmarks page.

Huge pages

When huge pages are available the benchmarks use 1x1GB or 16x2MB huge pages for the queues to minimise TLB misses. To enable huge pages do one of:

sudo hugeadm --pool-pages-min 1GB:1
sudo hugeadm --pool-pages-min 2MB:16

Alternatively, you may like to enable transparent hugepages in your system and use a hugepage-aware allocator, such as tcmalloc.

Real-time thread throttling

By default, Linux scheduler throttles real-time threads from consuming 100% of CPU and that is detrimental to benchmarking. Full details can be found in Real-Time group scheduling. To disable real-time thread throttling do:

echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us >/dev/null

Throughput and scalability benchmark

N producer threads push a 4-byte integer into one same queue, N consumer threads pop the integers from the queue. All producers posts 1,000,000 messages in total. Total time to send and receive all the messages is measured. The benchmark is run for from 1 producer and 1 consumer up to (total-number-of-cpus / 2) producers/consumers to measure the scalability of different queues.

Ping-pong benchmark

One thread posts an integer to another thread through one queue and waits for a reply from another queue (2 queues in total). The benchmarks measures the total time of 100,000 ping-pongs, best of 10 runs. Contention is minimal here (1-producer-1-consumer, 1 element in the queue) to be able to achieve and measure the lowest latency. Reports the average round-trip time.

Contributing

Contributions are more than welcome. .editorconfig and .clang-format can be used to automatically match code formatting.

Reading material

Some books on the subject of multi-threaded programming I found instructive:


Copyright (c) 2019 Maxim Egorushkin. MIT License. See the full licence in file LICENSE.