Closed WojciechMula closed 5 years ago
For improved portability, I recommend the following functions:
// portable version of posix_memalign
static inline void *aligned_malloc(size_t alignment, size_t size) {
void *p;
#ifdef _MSC_VER
p = _aligned_malloc(size, alignment);
#elif defined(__MINGW32__) || defined(__MINGW64__)
p = __mingw_aligned_malloc(size, alignment);
#else
if (posix_memalign(&p, alignment, size) != 0) { return nullptr; }
#endif
return p;
}
static inline void aligned_free(void *memblock) {
if(memblock == nullptr) { return; }
#ifdef _MSC_VER
_aligned_free(memblock);
#elif defined(__MINGW32__) || defined(__MINGW64__)
__mingw_aligned_free(memblock);
#else
free(memblock);
#endif
}
That is what we use in simdjson.
@WojciechMula Great work! Noticeable differences that's for sure. I wrongly assumed we did this already! It's been implemented in the non-instrumented benchmark since forever.
@WojciechMula Could you add the code suggest by @lemire and I will merge this PR.
I wrongly assumed we did this already! It's been implemented in the non-instrumented benchmark since forever.
I don't think we want to assume that the uint16_t *
arrays are aligned on cache lines without qualification. If our results depend on that, you we need to disclose it... because actual applications won't meet this requirement typically. It is an engineering constraint. It is not a trivial constraint.
Though @WojciechMula is not entirely explicit, I don't think that's what he has in mind when he writes the following...
"I think we might try to align pointers. If the address is even, it's possible -- we have to process 31 - (address & 0x3f) / 2 input words with a scalar code. Or round down the address and do masked load (making the code address-sanitizer-unfriendly)."
...
I think that he means that our functions should be fixed so that they skip the first few words... up to the point where they are aligned on a cache line... and then proceed from there. That is, he is hinting that our software should be better written.
That's almost entirely trivial software-wise if we have 16-bit alignment... but 16-bit alignment is far more reasonable an assumption.
I don't think we want to assume that the uint16_t * arrays are aligned on cache lines without qualification. If our results depend on that, you we need to disclose it... because actual applications won't meet this requirement typically. It is an engineering constraint. It is not a trivial constraint.
Results are reported using unaligned memory (instrumented_benchmark).
I think that he means that our functions should be fixed so that they skip the first few words... up to the point where they are aligned on a cache line... and then proceed from there.
This is interesting. 10% is attractive if generalizable.
10% is attractive if generalizable.
I suspect it is. We are just missing a little bit of engineering effort.
@mklarqvist I included the procedure Daniel proposed and handle #ifdefs a bit better.
I have an idea how to wrap our existing SIMD procedures into a generic macro, that would take care of alignment and properly pass pointers & length to main SIMD function and scalar fallbacks. Will open new MR for this.
When input data is aligned to the page boundary (64 bytes) we can observe some improvements for AVX512 code. See the collations below.
I think we might try to align pointers. If the address is even, it's possible -- we have to process
31 - (address & 0x3f) / 2
input words with a scalar code. Or round down the address and do masked load (making the code address-sanitizer-unfriendly).