Mistakes in sse4_despace_branchless_mask8

lemire / despacer

C library to remove white space from strings as fast as possible

BSD 3-Clause "New" or "Revised" License

151 stars 14 forks source link

Mistakes in sse4_despace_branchless_mask8 #6

Closed avodaniel closed 7 years ago

avodaniel commented 7 years ago

I found a few mistakes in sse4_despace_branchless_mask8(.), they probably don't influence benchamerk results but prevent it from generating correct string:

_mm_or_si128(m1,m2) should be replaced with _mm_and_si128(m1,m2), because 0xFF & x == x, 0xFF | x = 0xFF .
Tables are probably wrong too (on Intel CPUs). First line of despace_mask8_1 should look:

0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0x8,0x9,0xA,0xB,0xC,0xD,0xE,0xF,

instead of:

0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0x7,0x6,0x5,0x4,0x3,0x2,0x1,0x0,

First line of despace_mask8_2 should look:

0x0,0x1,0x2,0x3,0x4,0x5,0x6,0x7,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,

instead of:

0xf,0xe,0xd,0xc,0xb,0xa,0x9,0x8,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,

lemire commented 7 years ago

Thanks. I will correct these things and add unit tests soon.

lemire commented 7 years ago

I have added unit tests, and removed sse4_despace_branchless_mask8 since, as you point out, it is buggy. I don't have time to fix it right now. I am closing this issue, if you have time to fix sse4_despace_branchless_mask8, please let me know.

aqrit commented 7 years ago

in my fork,

all possible space/not_space combinations are checked. see the fillwithtext() and test_time() functions.

despace_ssse3_lut_1kb should be comparable to sse4_despace_branchless_mask8 for speed but with a much smaller table.

the popcnt instruction gives a noticeable but small benefit, however it is not really a sse4.2 instruction... IMO it is not worth it over plain ssse3.

using avx2's vpermd one can get good speed using only in-register luts.

           ssse3_lut_1mb:   66581270
            avx2_lut_1mb:   65218578
             avx2_vpermd:   66947818

lemire commented 7 years ago

@aqrit Wow. I merged your code which now benefits from unit testing (though not of your better, albeit slower, benchmark).

I am very impressed with avx2_vpermd. This is remarkable work.