Closed rhpvorderman closed 1 year ago
Nice! I noticed this "#ifdef __SSE2___ ... #endif while ..." pattern in the other PR and found it quite nice.
Yes, sometimes even without vectors you want to do a unrolled loop that does multiple operations and one that does only one. This pattern helps a lot with that. Also getting rid of a loop control variable sometimes means faster execution times. So it is a big win all around, even without vectorization.
Unaligned loads perform well on x86_64
Recently I have been writing quite some vectorized code and I decided to update my very first attempt at the matter. This is certainly much simpler. I did a quick check and pointer types are signed by default. (At least on my platform, intptr_t is a long, not an unsigned one). So deducting from end_ptr as in this code will simply work.
Daniel Lemire did a test and found there is no difference between unaligned and aligned loads: https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/. This was quite some time ago. I also did some reading lately and I found it confirmed that AMD and Intel specifically altered their architectures to make sure unaligned loads are just as fast. Data alignment is simply not an issue anymore for speed. Difference is not measurable. So unaligned loads are actually faster as you can start using vector instructions right away rather than having the overhead of an alignment loop first.
I did some quick testing and found no speed difference between this code and the old code. This will save quite some lines.