sccn / liblsl

C++ lsl library for multi-modal time-synched data transmission over the local network
Other
107 stars 63 forks source link

Rework alignment in consumer_queue #159

Closed tstenner closed 2 years ago

tstenner commented 2 years ago

alignas(64) wasn't supported at all for dynamically allocated memory (<C++17), leading to compiler warnings (#142) and undefined behavior according to the Clang UBSan.

Instead, reorder the fields and insert additional padding (if needed) between fields that shouldn't share a cacheline.

cboulay commented 2 years ago

I'm pretty new to dealing with alignment issues.

My understanding of the original problem is as follows:

This PR gets rid of the alignas and replaces it with manual padding. This keeps read_idx_, write_idx_, AND done_sync_ in different cachelines (probably?), but it doesn't do any actual alignment. They might be far from a boundary. Or, in the worst case, they might span a boundary and read_idx_'s later bits might share a 64-byte block with the early bits of write_idx_. Is this actually possible? I admit complete ignorance here. If it's true that they can span a boundary, then is it likely that this change will be slower than the original?

An alternative might be to create the consumer_queue as a global or on the stack instead of the heap, because globals and stack allocation respects alignment. I think this is impossible. For example, consumer_queue is used in data_receiver's consumer_queue sample_queue_, data_receiver is a member of stream_inlet_impl, which is part of the API. We can't dictate to users to never allocate an inlet on the heap!

Another alternative is to override new and delete to do memory-aligned allocation and deallocation. This seems to be what is suggested most often on SO. @tstenner , what do you think about that option?

tstenner commented 2 years ago

This PR gets rid of the alignas and replaces it with manual padding. This keeps read_idx_, write_idx_, AND done_sync_ in different cachelines (probably?), but it doesn't do any actual alignment.

Correct. They are still aligned to the types natural alignment (4 or 8 bytes), but other than that there's no further alignment necessary. The consumer_queue objects themselves are (most probably) aligned to 16 bytes, so they might have one of four positions in a cache line. CPUs fetch/invalidate the entire cache line, independently of the position of the variable within it, so it doesn't matter where it is in the single cache line.

When the offset between one of the three variables is larger than 64 bytes, they might end up more than one cache line apart, but all in all the consumer_queue object fits in less cache lines.

Or, in the worst case, they might span a boundary and read_idx_'s later bits might share a 64-byte block with the early bits of write_idx_. Is this actually possible? I admit complete ignorance here.

They are still aligned to 4 / 8 bytes (otherwise we'd need something like #pragma pack(1)), so they can't be split across cache line boundaries.

[..] consumer_queue as a global or on the stack instead of the heap [..]. I think this is impossible.

Almost 100% correct. It's possible to reserver sizeof(consumer_queue) + CACHELINE_BYTES - 1 bytes, placement-new the object to an aligned offset and store the offset so a custom operator delete can delete the correct memory region originally allocated. C++17 also makes it easier, but for VS 2015 it's hard to impossible.

Another alternative is to override new and delete to do memory-aligned allocation and deallocation. This seems to be what is suggested most often on SO.

Without something like aligned_alloc, _mm_malloc, posix_memalign or _aligned_malloc it requires a workaround as described above. The C11 aligned_alloc isn't even supported by VS 2019 and std::align_val_t is C++17.

The main reason aligned memory is important to some people is to fit an entire object into a cache line or for SIMD operations, but we need neither of those.

cboulay commented 2 years ago

They are still aligned to 4 / 8 bytes (otherwise we'd need something like #pragma pack(1)), so they can't be split across cache line boundaries.

OK that's the key thing I didn't know.

With this method we might use up to (64 - {whatever is in the last cacheline}) bytes more memory using this method than if we did more careful alignment, but I think that's small enough to ignore.

The only question I have left is if padding should use n * sizeof(void *) or n * sizeof(size_t) (or maybe even sizeof(size_) + sizeof(wrap_at_) + ... in case the types ever change). I think it's dangerous to use sizeof(void *) if that is larger than sizeof(size_t) because that might mean that we don't pad enough, right?

tstenner commented 2 years ago

With this method we might use up to (64 - {whatever is in the last cacheline}) bytes more memory using this method than if we did more careful alignment, but I think that's small enough to ignore.

With alignas the padding would be inserted by the compiler, so it would insert at least 56 bytes between each element.

I think it's dangerous to use sizeof(void *) if that is larger than sizeof(size_t) because that might mean that we don't pad enough, right?

Dangerous in the sense that it would take up a handful of cycles to synchronize the cache lines. I have fixed the types in the latest commit and adjusted for padding between the existing members.

cboulay commented 2 years ago

OK. All approved by me. Please merge when you are ready.

tstenner commented 2 years ago

Let's make that when MSVC is ready…