Optionally use aligned input data

WojciechMula commented 5 years ago

When input data is aligned to the page boundary (64 bytes) we can observe some improvements for AVX512 code. See the collations below.

I think we might try to align pointers. If the address is even, it's possible -- we have to process 31 - (address & 0x3f) / 2 input words with a scalar code. Or round down the address and do masked load (making the code address-sanitizer-unfriendly).

Skylake-X                                                                default   align 64 | improvement
----------------------------------------------------------------------|         |
avx512popcnt                                cycles per 16-bit word:  0.091      0.063 1.444     +30.8%
pospopcnt_u16                               cycles per 16-bit word:  0.119      0.107 1.112     +10.1%
pospopcnt_u16_scalar_naive                  cycles per 16-bit word:  3.018      3.005 1.004      +0.4%
pospopcnt_u16_scalar_naive_nosimd           cycles per 16-bit word: 17.694     17.664 1.002      +0.2%
pospopcnt_u16_scalar_partition              cycles per 16-bit word:  3.046      3.025 1.007      +0.7%
pospopcnt_u16_scalar_hist1x4                cycles per 16-bit word:  2.929      2.923 1.002      +0.2%
pospopcnt_u16_scalar_umul128                cycles per 16-bit word:  2.340      2.342 0.999      -0.1%
pospopcnt_u16_scalar_umul128_unroll2        cycles per 16-bit word:  1.902      1.902 1.000      +0.0%
pospopcnt_u16_sse_single                    cycles per 16-bit word:  4.024      3.986 1.010      +0.9%
pospopcnt_u16_sse_blend_popcnt              cycles per 16-bit word:  1.622      1.615 1.004      +0.4%
pospopcnt_u16_sse_blend_popcnt_unroll4      cycles per 16-bit word:  1.396      1.397 0.999      -0.1%
pospopcnt_u16_sse_blend_popcnt_unroll8      cycles per 16-bit word:  1.347      1.346 1.001      +0.1%
pospopcnt_u16_sse_blend_popcnt_unroll16     cycles per 16-bit word:  1.411      1.412 0.999      -0.1%
pospopcnt_u16_sse2_sad                      cycles per 16-bit word:  0.999      1.000 0.999      -0.1%
pospopcnt_u16_sse2_harvey_seal              cycles per 16-bit word:  0.397      0.402 0.988      -1.3%
pospopcnt_u16_avx2_popcnt                   cycles per 16-bit word:  2.371      2.193 1.081      +7.5%
pospopcnt_u16_avx2                          cycles per 16-bit word:  3.021      3.012 1.003      +0.3%
pospopcnt_u16_avx2_naive_counter            cycles per 16-bit word:  3.021      3.012 1.003      +0.3%
pospopcnt_u16_avx2_single                   cycles per 16-bit word:  2.956      2.901 1.019      +1.9%
pospopcnt_u16_avx2_lemire                   cycles per 16-bit word:  1.914      1.916 0.999      -0.1%
pospopcnt_u16_avx2_lemire2                  cycles per 16-bit word:  1.133      1.128 1.004      +0.4%
pospopcnt_u16_avx2_blend_popcnt             cycles per 16-bit word:  1.009      0.933 1.081      +7.5%
pospopcnt_u16_avx2_blend_popcnt_unroll4     cycles per 16-bit word:  0.772      0.710 1.087      +8.0%
pospopcnt_u16_avx2_blend_popcnt_unroll8     cycles per 16-bit word:  0.705      0.681 1.035      +3.4%
pospopcnt_u16_avx2_blend_popcnt_unroll16    cycles per 16-bit word:  0.729      0.723 1.008      +0.8%
pospopcnt_u16_avx2_adder_forest             cycles per 16-bit word:  0.397      0.388 1.023      +2.3%
pospopcnt_u16_avx2_harvey_seal              cycles per 16-bit word:  0.226      0.215 1.051      +4.9%
pospopcnt_u16_avx512                        cycles per 16-bit word:  1.600      1.503 1.065      +6.1%
pospopcnt_u16_avx512bw_popcnt32_mask        cycles per 16-bit word:  1.357      0.857 1.583     +36.8%
pospopcnt_u16_avx512bw_popcnt64_mask        cycles per 16-bit word:  0.968      0.841 1.151     +13.1%
pospopcnt_u16_avx512_masked_ops             cycles per 16-bit word:  2.001      2.001 1.000      +0.0%
pospopcnt_u16_avx512_popcnt                 cycles per 16-bit word:  1.737      1.664 1.044      +4.2%
pospopcnt_u16_avx512bw_blend_popcnt         cycles per 16-bit word:  0.762      0.593 1.285     +22.2%
pospopcnt_u16_avx512bw_blend_popcnt_unroll4 cycles per 16-bit word:  0.578      0.558 1.036      +3.5%
pospopcnt_u16_avx512bw_blend_popcnt_unroll8 cycles per 16-bit word:  0.582      0.556 1.047      +4.5%
pospopcnt_u16_avx512_mula2                  cycles per 16-bit word:  0.508      0.506 1.004      +0.4%
pospopcnt_u16_avx512bw_adder_forest         cycles per 16-bit word:  0.290      0.276 1.051      +4.8%
pospopcnt_u16_avx512bw_harvey_seal          cycles per 16-bit word:  0.118      0.107 1.103      +9.3%

CannonLake                                                               default   align 64 | improvement
----------------------------------------------------------------------|         |
avx512popcnt                                    cycles per 16-bit word:  0.107      0.088 1.216     +17.8%
pospopcnt_u16                                   cycles per 16-bit word:  0.133      0.118 1.127     +11.3%
pospopcnt_u16_scalar_naive                      cycles per 16-bit word:  2.059      2.052 1.003      +0.3%
pospopcnt_u16_scalar_naive_nosimd               cycles per 16-bit word: 17.521     17.546 0.999      -0.1%
pospopcnt_u16_scalar_partition                  cycles per 16-bit word:  3.055      3.070 0.995      -0.5%
pospopcnt_u16_scalar_hist1x4                    cycles per 16-bit word:  2.831      2.828 1.001      +0.1%
pospopcnt_u16_scalar_umul128                    cycles per 16-bit word:  2.562      2.562 1.000      +0.0%
pospopcnt_u16_scalar_umul128_unroll2            cycles per 16-bit word:  2.057      2.056 1.000      +0.0%
pospopcnt_u16_sse_single                        cycles per 16-bit word:  3.861      3.855 1.002      +0.2%
pospopcnt_u16_sse_blend_popcnt                  cycles per 16-bit word:  2.077      2.076 1.000      +0.0%
pospopcnt_u16_sse_blend_popcnt_unroll4          cycles per 16-bit word:  1.574      1.574 1.000      +0.0%
pospopcnt_u16_sse_blend_popcnt_unroll8          cycles per 16-bit word:  1.433      1.433 1.000      +0.0%
pospopcnt_u16_sse_blend_popcnt_unroll16         cycles per 16-bit word:  1.385      1.383 1.001      +0.1%
pospopcnt_u16_sse2_sad                          cycles per 16-bit word:  1.012      1.013 0.999      -0.1%
pospopcnt_u16_sse2_harvey_seal                  cycles per 16-bit word:  0.362      0.364 0.995      -0.6%
pospopcnt_u16_avx2_popcnt                       cycles per 16-bit word:  2.360      2.349 1.005      +0.5%
pospopcnt_u16_avx2                              cycles per 16-bit word:  2.034      2.027 1.003      +0.3%
pospopcnt_u16_avx2_naive_counter                cycles per 16-bit word:  2.033      2.026 1.003      +0.3%
pospopcnt_u16_avx2_single                       cycles per 16-bit word:  2.483      2.476 1.003      +0.3%
pospopcnt_u16_avx2_lemire                       cycles per 16-bit word:  2.861      2.860 1.000      +0.0%
pospopcnt_u16_avx2_lemire2                      cycles per 16-bit word:  1.689      1.689 1.000      +0.0%
pospopcnt_u16_avx2_blend_popcnt                 cycles per 16-bit word:  1.102      1.100 1.002      +0.2%
pospopcnt_u16_avx2_blend_popcnt_unroll4         cycles per 16-bit word:  0.840      0.841 0.999      -0.1%
pospopcnt_u16_avx2_blend_popcnt_unroll8         cycles per 16-bit word:  0.750      0.750 1.000      +0.0%
pospopcnt_u16_avx2_blend_popcnt_unroll16        cycles per 16-bit word:  0.729      0.730 0.999      -0.1%
pospopcnt_u16_avx2_adder_forest                 cycles per 16-bit word:  0.428      0.418 1.024      +2.3%
pospopcnt_u16_avx2_harvey_seal                  cycles per 16-bit word:  0.221      0.212 1.042      +4.1%
pospopcnt_u16_avx512                            cycles per 16-bit word:  1.562      1.506 1.037      +3.6%
pospopcnt_u16_avx512bw_popcnt32_mask            cycles per 16-bit word:  0.911      0.819 1.112     +10.1%
pospopcnt_u16_avx512bw_popcnt64_mask            cycles per 16-bit word:  0.887      0.832 1.066      +6.2%
pospopcnt_u16_avx512_masked_ops                 cycles per 16-bit word:  1.836      1.834 1.001      +0.1%
pospopcnt_u16_avx512_popcnt                     cycles per 16-bit word:  1.674      1.667 1.004      +0.4%
pospopcnt_u16_avx512bw_blend_popcnt             cycles per 16-bit word:  0.841      0.758 1.109      +9.9%
pospopcnt_u16_avx512bw_blend_popcnt_unroll4     cycles per 16-bit word:  0.651      0.627 1.038      +3.7%
pospopcnt_u16_avx512bw_blend_popcnt_unroll8     cycles per 16-bit word:  0.562      0.558 1.007      +0.7%
pospopcnt_u16_avx512_mula2                      cycles per 16-bit word:  0.519      0.517 1.004      +0.4%
pospopcnt_u16_avx512bw_adder_forest             cycles per 16-bit word:  0.294      0.281 1.046      +4.4%
pospopcnt_u16_avx512bw_harvey_seal              cycles per 16-bit word:  0.136      0.123 1.106      +9.6%
pospopcnt_u16_avx512vbmi_harvey_seal            cycles per 16-bit word:  0.139      0.125 1.112     +10.1%

lemire commented 5 years ago

For improved portability, I recommend the following functions:

// portable version of  posix_memalign
static inline void *aligned_malloc(size_t alignment, size_t size) {
    void *p;
#ifdef _MSC_VER
    p = _aligned_malloc(size, alignment);
#elif defined(__MINGW32__) || defined(__MINGW64__)
    p = __mingw_aligned_malloc(size, alignment);
#else
    if (posix_memalign(&p, alignment, size) != 0) { return nullptr; }
#endif
    return p;
}

static inline void aligned_free(void *memblock) {
    if(memblock == nullptr) { return; }
#ifdef _MSC_VER
    _aligned_free(memblock);
#elif defined(__MINGW32__) || defined(__MINGW64__)
    __mingw_aligned_free(memblock);
#else
    free(memblock);
#endif
}

That is what we use in simdjson.

mklarqvist commented 5 years ago

@WojciechMula Great work! Noticeable differences that's for sure. I wrongly assumed we did this already! It's been implemented in the non-instrumented benchmark since forever.

mklarqvist commented 5 years ago

@WojciechMula Could you add the code suggest by @lemire and I will merge this PR.

lemire commented 5 years ago

I wrongly assumed we did this already! It's been implemented in the non-instrumented benchmark since forever.

I don't think we want to assume that the uint16_t * arrays are aligned on cache lines without qualification. If our results depend on that, you we need to disclose it... because actual applications won't meet this requirement typically. It is an engineering constraint. It is not a trivial constraint.

Though @WojciechMula is not entirely explicit, I don't think that's what he has in mind when he writes the following...

"I think we might try to align pointers. If the address is even, it's possible -- we have to process 31 - (address & 0x3f) / 2 input words with a scalar code. Or round down the address and do masked load (making the code address-sanitizer-unfriendly)."

...

I think that he means that our functions should be fixed so that they skip the first few words... up to the point where they are aligned on a cache line... and then proceed from there. That is, he is hinting that our software should be better written.

That's almost entirely trivial software-wise if we have 16-bit alignment... but 16-bit alignment is far more reasonable an assumption.

mklarqvist commented 5 years ago

I don't think we want to assume that the uint16_t * arrays are aligned on cache lines without qualification. If our results depend on that, you we need to disclose it... because actual applications won't meet this requirement typically. It is an engineering constraint. It is not a trivial constraint.

Results are reported using unaligned memory (instrumented_benchmark).

I think that he means that our functions should be fixed so that they skip the first few words... up to the point where they are aligned on a cache line... and then proceed from there.

This is interesting. 10% is attractive if generalizable.

lemire commented 5 years ago

10% is attractive if generalizable.

I suspect it is. We are just missing a little bit of engineering effort.

WojciechMula commented 5 years ago

@mklarqvist I included the procedure Daniel proposed and handle #ifdefs a bit better.

I have an idea how to wrap our existing SIMD procedures into a generic macro, that would take care of alignment and properly pass pointers & length to main SIMD function and scalar fallbacks. Will open new MR for this.

mklarqvist / positional-popcount

Optionally use aligned input data #21