lemire / despacer

C library to remove white space from strings as fast as possible
BSD 3-Clause "New" or "Revised" License
151 stars 14 forks source link

Avoid unnecessary sign-extending instructions #20

Open sharpobject opened 10 months ago

sharpobject commented 10 months ago

Use unsigned types to store the result of popcnt and movemask because otherwise we will get a movsx to sign-extend these values (when we subsequently use them as indices into an array or whatever), which is unnecessary in almost all cases and incorrect if it ever does anything

sharpobject commented 10 months ago

Implementations actually affected by this patch seem to be these on my Haswell server:

avx2_despace_branchless(buffer, N)                :  base frequency  3.91 GHz speed:  10.80 GB/s -> 11.08 GB/s
avx2_despace_branchless(buffer, N)                :  base frequency  3.91 GHz speed:  10.83 GB/s -> 11.07 GB/s
avx2_despace_branchless(buffer, N)                :  base frequency  3.91 GHz speed:  10.88 GB/s -> 11.03 GB/s
sse4_despace_branchless_u2(buffer, N)             :  base frequency  3.91 GHz speed:  8.71 GB/s -> 8.50 GB/s
sse4_despace_branchless_u2(buffer, N)             :  base frequency  3.91 GHz speed:  8.67 GB/s -> 8.48 GB/s
sse4_despace_branchless_u2(buffer, N)             :  base frequency  3.91 GHz speed:  8.67 GB/s -> 8.48 GB/s
sse4_despace_branchless_u4(buffer, N)             :  base frequency  3.91 GHz speed:  8.77 GB/s -> 8.47 GB/s
sse4_despace_branchless_u4(buffer, N)             :  base frequency  3.91 GHz speed:  8.77 GB/s -> 8.50 GB/s
sse4_despace_branchless_u4(buffer, N)             :  base frequency  3.91 GHz speed:  8.74 GB/s -> 8.36 GB/s
sse4_despace_skinny_u4(buffer, N)                 :  base frequency  3.91 GHz speed:  7.56 GB/s -> 7.72 GB/s
sse4_despace_skinny_u4(buffer, N)                 :  base frequency  3.91 GHz speed:  7.56 GB/s -> 7.80 GB/s
sse4_despace_skinny_u4(buffer, N)                 :  base frequency  3.91 GHz speed:  7.56 GB/s -> 7.69 GB/s
sse42_despace_branchless(buffer, N)               :  base frequency  3.91 GHz speed:  7.82 GB/s -> 7.85 GB/s
sse42_despace_branchless(buffer, N)               :  base frequency  3.91 GHz speed:  7.82 GB/s -> 7.85 GB/s
sse42_despace_branchless(buffer, N)               :  base frequency  3.91 GHz speed:  7.82 GB/s -> 7.85 GB/s
sse42_despace_branchless_lookup(buffer, N)        :  base frequency  3.91 GHz speed:  7.09 GB/s -> 7.85 GB/s
sse42_despace_branchless_lookup(buffer, N)        :  base frequency  3.91 GHz speed:  7.09 GB/s -> 7.85 GB/s
sse42_despace_branchless_lookup(buffer, N)        :  base frequency  3.91 GHz speed:  7.09 GB/s -> 7.85 GB/s

Though it's disappointing that I've made a couple of them slower...

sharpobject commented 10 months ago

Sorry, I think this needs more work to avoid doing any harm. I'll try to come back to this in a couple days.