Open sharpobject opened 10 months ago
Implementations actually affected by this patch seem to be these on my Haswell server:
avx2_despace_branchless(buffer, N) : base frequency 3.91 GHz speed: 10.80 GB/s -> 11.08 GB/s
avx2_despace_branchless(buffer, N) : base frequency 3.91 GHz speed: 10.83 GB/s -> 11.07 GB/s
avx2_despace_branchless(buffer, N) : base frequency 3.91 GHz speed: 10.88 GB/s -> 11.03 GB/s
sse4_despace_branchless_u2(buffer, N) : base frequency 3.91 GHz speed: 8.71 GB/s -> 8.50 GB/s
sse4_despace_branchless_u2(buffer, N) : base frequency 3.91 GHz speed: 8.67 GB/s -> 8.48 GB/s
sse4_despace_branchless_u2(buffer, N) : base frequency 3.91 GHz speed: 8.67 GB/s -> 8.48 GB/s
sse4_despace_branchless_u4(buffer, N) : base frequency 3.91 GHz speed: 8.77 GB/s -> 8.47 GB/s
sse4_despace_branchless_u4(buffer, N) : base frequency 3.91 GHz speed: 8.77 GB/s -> 8.50 GB/s
sse4_despace_branchless_u4(buffer, N) : base frequency 3.91 GHz speed: 8.74 GB/s -> 8.36 GB/s
sse4_despace_skinny_u4(buffer, N) : base frequency 3.91 GHz speed: 7.56 GB/s -> 7.72 GB/s
sse4_despace_skinny_u4(buffer, N) : base frequency 3.91 GHz speed: 7.56 GB/s -> 7.80 GB/s
sse4_despace_skinny_u4(buffer, N) : base frequency 3.91 GHz speed: 7.56 GB/s -> 7.69 GB/s
sse42_despace_branchless(buffer, N) : base frequency 3.91 GHz speed: 7.82 GB/s -> 7.85 GB/s
sse42_despace_branchless(buffer, N) : base frequency 3.91 GHz speed: 7.82 GB/s -> 7.85 GB/s
sse42_despace_branchless(buffer, N) : base frequency 3.91 GHz speed: 7.82 GB/s -> 7.85 GB/s
sse42_despace_branchless_lookup(buffer, N) : base frequency 3.91 GHz speed: 7.09 GB/s -> 7.85 GB/s
sse42_despace_branchless_lookup(buffer, N) : base frequency 3.91 GHz speed: 7.09 GB/s -> 7.85 GB/s
sse42_despace_branchless_lookup(buffer, N) : base frequency 3.91 GHz speed: 7.09 GB/s -> 7.85 GB/s
Though it's disappointing that I've made a couple of them slower...
Sorry, I think this needs more work to avoid doing any harm. I'll try to come back to this in a couple days.
Use unsigned types to store the result of popcnt and movemask because otherwise we will get a movsx to sign-extend these values (when we subsequently use them as indices into an array or whatever), which is unnecessary in almost all cases and incorrect if it ever does anything