ludocode / mpack

MPack - A C encoder/decoder for the MessagePack serialization format / msgpack.org[C]
MIT License
533 stars 82 forks source link

Possible encoding problems with differing double/int64 endianness #15

Closed rikvdh closed 9 years ago

rikvdh commented 9 years ago

Hi Nicholas,

I'm working on an ARM platform(little-endian) with your library. I was using doubles and floats, but this probably also counts for other types in mpack. The items in the the packet are endian swapped and not sent big-endian as described in the spec. https://github.com/msgpack/msgpack/blob/master/spec.md#formats-float

When I send them to a server (i.e. x86), the values are also wrong. Using a union for mapping to uint64 and double is valid on big-endian systems, but invalid on little-endian. https://github.com/ludocode/mpack/blob/8c5bd668b515ded9e8e8e5a2f48033d2c3f6bd75/src/mpack/mpack-writer.c#L377-L384

Perhaps it is a good idea to check for little-endian and swap the bytes when necessary?

ludocode commented 9 years ago

Huh, now that is interesting and surprising. It seems that you're working on a platform that has different endianness for floats versus ints.

The way endianness is handled currently works for fine for transferring floats between x86, amd64, and ARMv6 and up, since I'm able to communicate floats between iPhone, Android and desktop PC without endianness problems. But these platforms have matching endianness; the ARM phones use big-endian for everything and x86 uses little-endian for everything. I think. I had read about platforms that would have different endianness though, and that seems to be what's happening here. I'll have to test for swapped endianness, and swapped words possibly with swapped endianness (since apparently there are also platforms that store doubles in two separate 32-bit words in opposite order of their endianness...)

Can you give a bit more info on the platform you're working on? Do you know if there's a way I can emulate a similar platform here, perhaps with QEMU, to test it?

rikvdh commented 9 years ago

Sorry I checked again but I was wrong. I saw some weird values in my double. But I double checked and the values on the server seem to unpack correctly. :)

Sorry for the inconvenience.

ludocode commented 9 years ago

Hm. This does raise an interesting problem though which is whether platforms with differing endianness between ints and floats actually work. I'm reading a bit more and apparently I'm wrong about ARM; modern ARM processors can be configured for either big- or little-endian, and Android devices typically run in little-endian mode.

A Raspberry Pi has an option for it though. I'll have to see if I can boot it with a big-endian kernel and run the unit tests on it. I'll need to do a bunch more investigative work to see if the library is doing it correctly, and to possibly fix any platforms that aren't well supported.

rikvdh commented 9 years ago

I don't think there is a problem (at least in the writer). Because you store everything with bit-shifts. When you map an double to int64 it might be endian-swapped, but it doesn't matter because the int64 itself is also endian swapped. So no problems there. And when you do your bit-shifts here: https://github.com/ludocode/mpack/blob/master/src/mpack/mpack-writer.c#L287-L316 (val >> 56) still evaluates to the most significant 8-bits of the int64_t. and (val & 0xFF) is still the last one which is put at u[7].

You could (perhaps) improve performance there by using htonl() and memcpy()? You only would need to build endian-swap for int64 because it doesn't exist. I'm also not sure about C99 and htonl. Otherwise a custom implementation could suffice.

#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
        uint32_t high_part = htonl((uint32_t)(value >> 32));
        uint32_t low_part = htonl((uint32_t)(value & 0xFFFFFFFFLL));
        return (((uint64_t)low_part) << 32) | high_part;
#endif

EDIT: for the reader you do the same thing, no problems (beside the possible speed improvement) there. https://github.com/ludocode/mpack/blob/master/src/mpack/mpack-common.h#L277-L302

ludocode commented 9 years ago

Yeah, the bitshifting definitely handles endianness for ints correctly. I'm just worried about architectures where floats and ints have opposite endianness. For example if an architecture stores ints in little-endian and floats in big-endian, the bytes will get swapped around correctly when reading into an int, but then will be in the wrong order when we type-pun it to a float. (edit: the bytes will maybe be in the wrong order. The CPU might correctly swap the byte order when moving data (without conversion) from an int to a float register, so maybe this happens automatically. I don't actually know. But I'd like to test it to make sure.)

I'm not sure if performance gains would be made by using htonl(). I'll check though. The best performance gains would probably be made by using specific compiler intrinsics for swapping bytes. GCC supports __builtin_bswap64() for example:

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

rikvdh commented 9 years ago

Is that possible? having different endianness for different type of variables? I know some Architectures can switch endianness, but this isn't dynamic behaviour since it would require moving all the memory around.

The __builtin_bswap64() always swaps, the htonl() only swaps if necessary. When you are on a big endian platform, htonl does nothing. The __builtin_bswap64() is also not very portable.

ludocode commented 9 years ago

I tried using __builtin_bswap64() to see if it would be faster. The below obviously only works on a little-endian machine:

MPACK_ALWAYS_INLINE uint64_t mpack_load_native_u64(const char* p) {
    uint64_t val;
    memcpy(&val, p, sizeof(val));
    return __builtin_bswap64(val);
}

I also tried removing the memcpy() and just doing an unaligned type-aliased read. The below is even less portable:

MPACK_ALWAYS_INLINE uint64_t mpack_load_native_u64(const char* p) {
    return __builtin_bswap64(*(const uint64_t*)p);
}

Turns out all three of these (including the current bit-shifting code) compile to exactly the same thing on x86_64 with GCC 5.2 under -O3. Using noinline:

mpack_load_native_u64:
        movq    (%rdi), %rax
        bswap   %rax
        ret

GCC 5.2.0 is smart enough to figure out that the portable bit-shifting code can be implemented with an unaligned read and bswap instruction. That's kind of amazing.

On the other hand, on a Raspberry Pi B2 512MB (000e), the version with memcpy() and __builtin_bswap64() does actually seem to run slightly faster than the bit-shifting version (t-test result 0.0175). More interestingly (and perhaps relatedly), the resulting executable of the benchmark reader is about 800 bytes smaller as well. So it seems GCC's bswap detection is platform-specific. The lack of a bswap instruction makes it avoid re-writing the bitshifting code as an unaligned read and byte swap.

Here's the memcpy()/__builtin_bswap64() version on Raspberry Pi:

mpack_load_native_u64:
        @ args = 0, pretend = 0, frame = 8
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        mov     r2, r0
        sub     sp, sp, #8
        ldr     r0, [r0]        @ unaligned
        ldr     r1, [r2, #4]    @ unaligned
        mov     r3, sp
        stmia   r3!, {r0, r1}
        mov     r1, r0
        ldr     r0, [sp, #4]
        rev     r1, r1
        rev     r0, r0
        add     sp, sp, #8
        @ sp needed
        bx      lr

And here's the current bit-shifting version (as of 90a4a14328b1e95b3417551cc957a7ead72eab31):

mpack_load_native_u64:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        stmfd   sp!, {r4, r5, r6, r7, r8, r9, r10, fp}
        mov     r10, #0
        ldrb    r4, [r0, #1]    @ zero_extendqisi2
        ldrb    r2, [r0]        @ zero_extendqisi2
        ldrb    r6, [r0, #2]    @ zero_extendqisi2
        mov     r5, #0
        mov     r3, r4, asl #16
        ldrb    r4, [r0, #7]    @ zero_extendqisi2
        mov     fp, r2, asl #24
        orr     r2, r10, r10
        orr     r2, r2, r4
        ldrb    r4, [r0, #4]    @ zero_extendqisi2
        orr     r3, r3, fp
        orr     r2, r2, r10
        ldrb    r9, [r0, #3]    @ zero_extendqisi2
        mov     r7, r4, lsr #8
        orr     r3, r3, r5
        orr     r2, r2, r10
        mov     r5, r6, asl #8
        mov     r6, r4, asl #24
        ldrb    r4, [r0, #5]    @ zero_extendqisi2
        orr     r2, r2, r6
        ldrb    r6, [r0, #6]    @ zero_extendqisi2
        orr     r3, r3, r5
        orr     r3, r3, r9
        orr     r3, r3, r7
        mov     r9, r4, lsr #16
        mov     r8, r4, asl #16
        mov     r1, r6, lsr #24
        orr     r2, r2, r8
        orr     r3, r3, r9
        mov     r0, r6, asl #8
        orr     r0, r0, r2
        orr     r1, r1, r3
        ldmfd   sp!, {r4, r5, r6, r7, r8, r9, r10, fp}
        bx      lr

Neat. So, looks like there will be some advantages both in speed and space in trying to detect endianness and using compiler intrinsics for byte swapping.

I still haven't tried any platforms with weird int/float endianness to see if the float conversion works, which is the actual purpose of this issue. I'm glad it's working on your platform though, and it works for all common architectures and devices I've seen so far, so I'm less concerned about it.

ludocode commented 9 years ago

After doing more research, I really don't think I need to worry about platforms with esoteric float layouts or endianness. It seems only ancient machines used differing int/float endianness or used the weird half-endianness for doubles, and even there a modern compiler could probably do the transformation correctly on its own. Everything works on all the platforms I've tried (which includes iOS, Android and Raspberry Pi), so unless we find a platform where it's broken there's not really more I can do.

The performance improvements we could get from using byte-swapping intrinsics aren't really related to this issue so I'll close it. I'll be looking into more performance improvements like this soon.