Alignment set but not verified on Long

AustinSchuh commented 3 years ago

If I have a field foo_ulong:uint64, I see that results in the following code:

  void add_foo_ulong(uint64_t foo_ulong) {
    fbb_.AddElement<uint64_t>(Configuration::VT_FOO_ULONG, foo_ulong, 0);
  }

That calls PushElement, which ends up calling Align(sizeof(T));. In this case, that sets the min alignment to 8.

If I look in the Verify method, I see that VerifyField<uint64_t>(verifier, VT_FOO_ULONG) && is the line responsible for doing the verification. This boils down to return !field_offset || verifier.Verify<T>(data_, field_offset);

Nowhere in this path is alignment checked. I'm able to add 4 bytes to the beginning of my flatbuffer without the alignment check triggering.

Either the alignment shouldn't be set to 8, or the alignment should be checked for the field.

aardappel commented 3 years ago

Yes, I don't think the verifier currently checks alignment at all.

FlatBuffers has alignment because it was originally assumed that some CPUs would access aligned data faster or even would need it to not trap. That is not necessarily true anymore in many cases. Still, FlatBuffers is specified as having a certain alignment, so we should continue to output it as such.

Remains the question wether we should add alignment checks to the verifier. An accidentally unaligned buffer is fine on x86, but can cause trouble on ARM when accessed thru a typed pointer that is assumed to be aligned, and could trap. Since the verifier's purpose is to allow "safe access", we should check for this, at least on ARM.

The alternative that could be considered, is to ensure EndianScalar is implemented in terms of memcpy, and that it doesn't regress performance on any compiler/CPU combinations. memcpy would ensure "any alignment" is allowed, and thus would make a misaligned FlatBuffer not cause access problems. From my testing, Visual Studio Debug mode was the only compiler that would emit an actual memcpy call, which is expensive, all other compilers/modes would use a scalar load (and an alignment-safe one on ARM).

Once you have that, while a misaligned buffer is still not desirable, the verifier would not need to care.

It's interesting to fantasize what it would mean to allow fully unaligned (tightly packed) FlatBuffers. They would be backwards but not forwards compatible.

@vglavnyy @mikkelfj who may have opinions.

AustinSchuh commented 3 years ago

Gotch, that is super helpful. It checks alignment on offsets (those are required to be 4 byte aligned) today.

I just finished writing some code to copy a flatbuffer by finding the min and max memory address, computing the alignment, then placing it in a FlatBufferBuilder and returning the resulting offset. I'll do my best to preserve the alignment.

mikkelfj commented 3 years ago

Hi @AustinSchuh you can try to run your buffer through flatcc's verifier. It checks alignment. https://github.com/dvidelabs/flatcc#verifying-a-buffer

The only problem is that it checks alignment too well: the buffer must start on an aligned boundary to be verifiable. Ideally there would be a flag to check that the buffer would be aligned when placed into aligned memory. Several users have had problems with buffer alignment because they create a char buf[4000]; like array and finalizes a buffer into it. However, that buffer is rarely aligned.

@gwvo I believe there may be an increased need for aligned memory, I just don't recall where. But in one case this may be in WebAssembly, that at least last I checked, require alignment.

There are many places in flatcc that uses unaligned reads even when the buffer is aligned. I had a branch where I used memcpy instead of the union conversion trick, but I gave it up again. It got messy and I wasn't sure about performance on all compilers. This isn't so much about actually unaligned reads as it is about strict type conversion rules in modern C/C++.

Flatcc's portable sub-library has functions for unaligned_read in different sizes. This mostly checks if the platform is x86/x64, and if so, uses a pointer cast, and if not, goes the slow path (could probably be improved, but memcpy is the way to go, and it can be fast on some compilers). Actually this is very useful in many things like hash functions because you can just use it an optimise later if needed. https://github.com/dvidelabs/flatcc/blob/master/include/flatcc/portable/punaligned.h Notably, some of these should use memcpy and bswap.

mikkelfj commented 3 years ago

On memcpy optimizations, there are lengthy, and quite good, online discussions on this. The problem is that even if memcpy is optimized into native instructions, it still sometimes effect loop optimizations and similar, so the overall result can vary from excellent to rather poor.

vglavnyy commented 3 years ago

Alignment does matter.

On ARM, MIPS, Sparc, and some DSP unaligned memory access might case a hard fault.

Performance (x86/amd64):

Cache split lock. If a value located in two cache lines write access leads to excessive LOCK transactions on CPU bus to keep coherency of cache memory for all CPU cores (well known Split Lock issue in Linux Kernel).
Memory split. Access to a scalar located in two memory regions is like the split lock on the cache. A value crossed by internal memory boundary might require two write transactions on the memory bus, two access to DRAM pages, and two access to MMU TLB with extra miss-hits.

Both points are about performance in HPC or about high-performance databases. For example, slit lock might significantly degrate system performance (https://rigtorp.se/split-locks):

Running the above on my computer I get 1194 nanoseconds per operation. That’s two hundred times slower, two orders of magnitude!

Language-specific problems (C++): A program that access to unaligned value is ill-formed program.

6.6.5 Alignment [basic.align] Object types have alignment requirements (6.7.1, 6.7.2) which place restrictions on the addresses at which an object of that type may be allocated. An alignment is an implementation-defined integer value representing the number of bytes between successive addresses at which a given object can be allocated. An object type imposes an alignment requirement on every object of that type; stricter alignment can be requested using the alignment specifier (10.6.2).

Usualy alignof(T) == sizeof(T) is true for modern compilers, but it is not guaranteed to be true.

The result of the alignof operator reﬂects the alignment requirement of the type in the complete-object case.
A fundamental alignment is represented by an alignment less than or equal to the greatest alignment supported by the implementation in all contexts, which is equal to alignof(std::max_align_t).
Valid alignments include only those values returned by an alignof expression for the fundamental types plus an additional implementation-deﬁned set of values, which may be empty. Every alignment value shall be a non-negative integral power of two.

There is an interesting example of unaligned access with SSE optimization: A bug story: data alignment on x86.

As for issue, I think it would be better to add a special VerifyAlign method to test alignment in serialized messages.

mikkelfj commented 3 years ago

@vglavnyy while I do not disagree, there are a number of benchmarks where unaligned access experimentally has been show to be very close to aligned access under micro benchmark conditions, and some where it works less well on older ARM.

As to split locks, I'd say it is unlikely that you would want to lock memory in a read only buffer.

As to cache alignment, this is where the benchmark shows that the penalty is often small (except SIMD). However, cache line alignment is a different issue. Here unaligned memory can more easily span two cache lines, but it can also help pack data more densely and avoid extra cache line access.

As to language standard. This is a pain, but it is already a pain for aligned FlatBuffers because it is not only alignment but also type mismatch that cause trouble when convrerting raw buffers to meaningful data. So you already need to deal with this behind the scenes, although to an extend you can rely on char * casts being valid.

The C builder implementation is much more complicated and error prone compared to writing unaligned buffers.

Still, I believe alignment matters in a number of cases.

vglavnyy commented 3 years ago

As to split locks, I'd say it is unlikely that you would want to lock memory in a read only buffer.

The split lock was an example of unaligned access. Unaligned data increase the probability of Split Lock. With flatbuffers it is impossible to create LOCK race because it requires atomic access to a variable in memory.

However, a case with SSE or NEON optimization is possible. A compiler might generate the following code (pseudo-code):

flatbuffers::Array<uint32_t> fbs_array[N];
static_assert(alignof(uint32_t) % sizeof(uint32_t)  == 0);
n_align16 = (fbs_array % 16) / sizeof(uint32_t);
// number of words in aligned block
K = (N - n_align16) / (16 / sizeof(uint32_t));

procces_unaligned(fbs_array, n_align16);

// The first (the best) case with possible FAULT if (fbs_array % sizeof(uint32_t) != 0
process_aligned(fbs_array + n_align16, K);

// Another (the worst) case with an invalid result and possible FAULT
// aligned addr = (fbs_array & 0xFFFFFFFFFFFFFFF0) + 0x10;
procces_aligned((fbs_array / 16) * 16 + 16, K);

process_unaligned(fbs_array + n_align16 + K, N - n_align16 - K);

mikkelfj commented 3 years ago

However, a case with SSE or NEON optimization is possible.

Yes, this why types cannot be cast arbitrarily. For example uint16_t word[20]; uint64_t *p = (uint64_t*)word; goes wrong, not only in compiler errors, but also in possible optimizations that compiler might otherwise do, for example it might do SIMD in loop unrolling assuming 8 byte alignment on the data, when it is not.

However, if you type cast correctly, the compiler should not do such optimizations behind your back. This is what unaligned_read_uint64(p) is for.

aardappel commented 3 years ago

@mikkelfj no, Wasm doesn't need aligment, you can specify the alignment for every load instruction, which is allowed to be 1 byte even for bigger scalars. LLVM will emit this kind of load when it is the result of a memcpy being lowered. Source: I actually work on the LLVM Wasm backend :P

@vglavnyy I think those aligned advantages are present only when you're processing massive arrays natively. When reading most FlatBuffer data, especially when going over an indirection anyway, I'm pretty sure it makes no difference.

Unaligned access in C++ is undefined when using a T * directly to do the load, but not with memcpy.

mikkelfj commented 3 years ago

@aardappel thought you were. My info on Wasm is several years old.

But I do work on time sensitive massive arrays of data in FlatBuffers FWIW.

aardappel commented 3 years ago

@mikkelfj if you made a by default de-aligned/unpadded FlatBuffers, adding an option (or even default) to still align vectors would make sense.

mikkelfj commented 3 years ago

There are also structs that currently map 1:1 to C/C++.

mikkelfj commented 3 years ago

BTW: https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

Now I think I remember where alignment is important: billions of IoT devices. Including the widely popular ESP8266 and ESP32 which FlatCC is being used for.

aardappel commented 3 years ago

That link seems to support my point perfectly.

Is what these IOT devices do limited by array processing speed?

I am not advocating to force alignment off, I am advocating that the ideal serialization system should have it by default off, optionally on. If you have special use cases, turn it on.

AustinSchuh commented 3 years ago

Thanks everyone! I'm playing with checking alignment now based on the links from @mikkelfj above and adding alignment checks into flatbuffers as well to check.

mikkelfj commented 3 years ago

@aardappel

Is what these IOT devices do limited by array processing speed?

I'm not sure what they are used for exactly. I'm guessing that in most cases they will happily trade some low frequency CPU cycles for reduced memory consumption. But it probably varies a lot. In many case it is probably either configuration or network messages.

Anyway, if alignment goes the way of the DoDo, I'd rather not have it as an option, because as I mentioned, it is a significant maintenance burden and source of errors.

The Arrow format might be better for dealing with large data volumes, and I think it might be aligned on array boundaries, but I never got around to dealing with it - gave one of the authors some input on how they use FlatBuffers for their schema.

@AustinSchuh Glad you can use it.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 6 months with no activity. Please comment or this will be closed in 14 days.

google / flatbuffers

Alignment set but not verified on Long #6230