Improvement suggestion: packed divider structs

kimwalisch commented 7 years ago

My primecount project uses a huge std::vector of branchfree u64 dividers.

struct libdivide_u64_branchfree_t {
    int64_t magic;
    uint8_t more;
};

Unfortunately GCC (macOS, GCC 6) padds the libdivide_u64_branchfree_t struct with 7 additional bytes to a total of 16 bytes. This increases the memory usage of my algorithm by a very significant amount.

Using attribute__((packed)) prevents GCC from padding the libdivide_u64_branchfree_t struct and reduces the memory usage of my algorithm by 30 percent and improves performance by 10% on x86-64 because the inner loop is more memory/cache efficient.

I will use attribute__((packed)) in primecount's libdivide.h and I wonder if we should use it the official ridiculousfish/libdivide.h as well?

Potential issue:

All modern CPU architectures (x86, x64, PPC64, AArch64) support fast unaligned memory access but some old CPU architectures e.g. IA-64 & sparc64 will run slower using attribute__((packed)) if an array or vector of libdivide divisors is used.

Here is an article from 2006 comparing the assembly code generation for packed and unpacked structs.

Personally, I would enable packed divider structs for all CPU architectures as I guess the number of users that will use an array or vector of libdivide divisors on ancient CPU architectures is very small.

Implementation considerations

attribute__((packed)) is supported by GCC & Clang but it is non-standard C/C++. MSVC supports #pragma pack which is also supported by GCC (since version 4.0 at least) and most likely Clang as well.

So I am in favour of using #pragma pack.

What's your opinion on packed divider structs fish? Can I go ahead and implement it (and then create a pull request)?

kimwalisch commented 7 years ago

This is what I have implemented in primecount's libdivide.h:

#pragma pack(push, 1)

struct libdivide_u32_t {
    uint32_t magic;
    uint8_t more;
};

struct libdivide_s32_t {
    int32_t magic;
    uint8_t more;
};

struct libdivide_u64_t {
    uint64_t magic;
    uint8_t more;
};

struct libdivide_s64_t {
    int64_t magic;
    uint8_t more;
};

struct libdivide_u32_branchfree_t {
    uint32_t magic;
    uint8_t more;
};

struct libdivide_s32_branchfree_t {
    int32_t magic;
    uint8_t more;
};

struct libdivide_u64_branchfree_t {
    uint64_t magic;
    uint8_t more;
};

struct libdivide_s64_branchfree_t {
    int64_t magic;
    uint8_t more;
};

#pragma pack(pop)

The above code compiles and works as expected with GCC, Clang and MSVC.

ridiculousfish commented 7 years ago

Makes sense to me, go for it. No need for a PR.

kimwalisch commented 7 years ago

Implemented packed divider structs using:

#pragma pack(push, 1)
...
#pragma pack(pop)

#pragma pack has been supported by all major C++ compilers for nearly a decade now: GCC 2005, Clang 2008, MSVC 2005 (likely even earlier).

ridiculousfish / libdivide

Improvement suggestion: packed divider structs #33

Potential issue:

Implementation considerations