Closed kimwalisch closed 7 years ago
This is what I have implemented in primecount's libdivide.h
:
#pragma pack(push, 1)
struct libdivide_u32_t {
uint32_t magic;
uint8_t more;
};
struct libdivide_s32_t {
int32_t magic;
uint8_t more;
};
struct libdivide_u64_t {
uint64_t magic;
uint8_t more;
};
struct libdivide_s64_t {
int64_t magic;
uint8_t more;
};
struct libdivide_u32_branchfree_t {
uint32_t magic;
uint8_t more;
};
struct libdivide_s32_branchfree_t {
int32_t magic;
uint8_t more;
};
struct libdivide_u64_branchfree_t {
uint64_t magic;
uint8_t more;
};
struct libdivide_s64_branchfree_t {
int64_t magic;
uint8_t more;
};
#pragma pack(pop)
The above code compiles and works as expected with GCC, Clang and MSVC.
Makes sense to me, go for it. No need for a PR.
Implemented packed divider structs using:
#pragma pack(push, 1)
...
#pragma pack(pop)
#pragma pack
has been supported by all major C++ compilers for nearly a decade now: GCC 2005, Clang 2008, MSVC 2005 (likely even earlier).
My primecount project uses a huge
std::vector
of branchfree u64 dividers.Unfortunately GCC (macOS, GCC 6) padds the
libdivide_u64_branchfree_t
struct with 7 additional bytes to a total of 16 bytes. This increases the memory usage of my algorithm by a very significant amount.Using
attribute__((packed))
prevents GCC from padding thelibdivide_u64_branchfree_t
struct and reduces the memory usage of my algorithm by 30 percent and improves performance by 10% on x86-64 because the inner loop is more memory/cache efficient.I will use
attribute__((packed))
in primecount'slibdivide.h
and I wonder if we should use it the official ridiculousfish/libdivide.h as well?Potential issue:
All modern CPU architectures (x86, x64, PPC64, AArch64) support fast unaligned memory access but some old CPU architectures e.g. IA-64 & sparc64 will run slower using
attribute__((packed))
if an array or vector of libdivide divisors is used.Here is an article from 2006 comparing the assembly code generation for packed and unpacked structs.
Personally, I would enable packed divider structs for all CPU architectures as I guess the number of users that will use an array or vector of libdivide divisors on ancient CPU architectures is very small.
Implementation considerations
attribute__((packed))
is supported by GCC & Clang but it is non-standard C/C++. MSVC supports#pragma pack
which is also supported by GCC (since version 4.0 at least) and most likely Clang as well.So I am in favour of using
#pragma pack
.What's your opinion on packed divider structs fish? Can I go ahead and implement it (and then create a pull request)?