Closed tstenner closed 2 years ago
I'm pretty new to dealing with alignment issues.
My understanding of the original problem is as follows:
consumer_queue
separated in different cachelines for more efficient access. e.g.., the read index and write index are accessed in different threads.alignas
consumer_queue
, don't respect alignment.This PR gets rid of the alignas
and replaces it with manual padding. This keeps read_idx_
, write_idx_
, AND done_sync_
in different cachelines (probably?), but it doesn't do any actual alignment. They might be far from a boundary. Or, in the worst case, they might span a boundary and read_idx_
's later bits might share a 64-byte block with the early bits of write_idx_
. Is this actually possible? I admit complete ignorance here.
If it's true that they can span a boundary, then is it likely that this change will be slower than the original?
An alternative might be to create the consumer_queue as a global or on the stack instead of the heap, because globals and stack allocation respects alignment. I think this is impossible. For example, consumer_queue
is used in data_receiver
's consumer_queue sample_queue_
, data_receiver
is a member of stream_inlet_impl
, which is part of the API. We can't dictate to users to never allocate an inlet on the heap!
Another alternative is to override new
and delete
to do memory-aligned allocation and deallocation. This seems to be what is suggested most often on SO. @tstenner , what do you think about that option?
This PR gets rid of the
alignas
and replaces it with manual padding. This keepsread_idx_
,write_idx_
, ANDdone_sync_
in different cachelines (probably?), but it doesn't do any actual alignment.
Correct. They are still aligned to the types natural alignment (4 or 8 bytes), but other than that there's no further alignment necessary. The consumer_queue
objects themselves are (most probably) aligned to 16 bytes, so they might have one of four positions in a cache line. CPUs fetch/invalidate the entire cache line, independently of the position of the variable within it, so it doesn't matter where it is in the single cache line.
When the offset between one of the three variables is larger than 64 bytes, they might end up more than one cache line apart, but all in all the consumer_queue
object fits in less cache lines.
Or, in the worst case, they might span a boundary and
read_idx_
's later bits might share a 64-byte block with the early bits ofwrite_idx_
. Is this actually possible? I admit complete ignorance here.
They are still aligned to 4 / 8 bytes (otherwise we'd need something like #pragma pack(1)
), so they can't be split across cache line boundaries.
[..] consumer_queue as a global or on the stack instead of the heap [..]. I think this is impossible.
Almost 100% correct. It's possible to reserver sizeof(consumer_queue) + CACHELINE_BYTES - 1
bytes, placement-new the object to an aligned offset and store the offset so a custom operator delete
can delete the correct memory region originally allocated. C++17 also makes it easier, but for VS 2015 it's hard to impossible.
Another alternative is to override
new
anddelete
to do memory-aligned allocation and deallocation. This seems to be what is suggested most often on SO.
Without something like aligned_alloc
, _mm_malloc
, posix_memalign
or _aligned_malloc
it requires a workaround as described above.
The C11 aligned_alloc
isn't even supported by VS 2019 and std::align_val_t
is C++17.
The main reason aligned memory is important to some people is to fit an entire object into a cache line or for SIMD operations, but we need neither of those.
They are still aligned to 4 / 8 bytes (otherwise we'd need something like #pragma pack(1)), so they can't be split across cache line boundaries.
OK that's the key thing I didn't know.
With this method we might use up to (64 - {whatever is in the last cacheline}) bytes more memory using this method than if we did more careful alignment, but I think that's small enough to ignore.
The only question I have left is if padding
should use n * sizeof(void *)
or n * sizeof(size_t)
(or maybe even sizeof(size_) + sizeof(wrap_at_) + ...
in case the types ever change).
I think it's dangerous to use sizeof(void *) if that is larger than sizeof(size_t) because that might mean that we don't pad enough, right?
With this method we might use up to (64 - {whatever is in the last cacheline}) bytes more memory using this method than if we did more careful alignment, but I think that's small enough to ignore.
With alignas
the padding would be inserted by the compiler, so it would insert at least 56 bytes between each element.
I think it's dangerous to use sizeof(void *) if that is larger than sizeof(size_t) because that might mean that we don't pad enough, right?
Dangerous in the sense that it would take up a handful of cycles to synchronize the cache lines. I have fixed the types in the latest commit and adjusted for padding between the existing members.
OK. All approved by me. Please merge when you are ready.
Let's make that when MSVC is ready…
alignas(64) wasn't supported at all for dynamically allocated memory (<C++17), leading to compiler warnings (#142) and undefined behavior according to the Clang UBSan.
Instead, reorder the fields and insert additional padding (if needed) between fields that shouldn't share a cacheline.