Closed avodaniel closed 7 years ago
Thanks. I will correct these things and add unit tests soon.
I have added unit tests, and removed sse4_despace_branchless_mask8
since, as you point out, it is buggy. I don't have time to fix it right now. I am closing this issue, if you have time to fix sse4_despace_branchless_mask8
, please let me know.
in my fork,
all possible space/not_space combinations are checked. see the fillwithtext() and test_time() functions.
despace_ssse3_lut_1kb should be comparable to sse4_despace_branchless_mask8 for speed but with a much smaller table.
the popcnt instruction gives a noticeable but small benefit, however it is not really a sse4.2 instruction... IMO it is not worth it over plain ssse3.
using avx2's vpermd one can get good speed using only in-register luts.
ssse3_lut_1mb: 66581270
avx2_lut_1mb: 65218578
avx2_vpermd: 66947818
@aqrit Wow. I merged your code which now benefits from unit testing (though not of your better, albeit slower, benchmark).
I am very impressed with avx2_vpermd. This is remarkable work.
I found a few mistakes in sse4_despace_branchless_mask8(.), they probably don't influence benchamerk results but prevent it from generating correct string:
_mm_or_si128(m1,m2)
should be replaced with_mm_and_si128(m1,m2)
, because0xFF & x == x
,0xFF | x = 0xFF
.Tables are probably wrong too (on Intel CPUs). First line of
despace_mask8_1
should look:0xff,0xff,0xff,0xff,0xff,0xff,0xff,0xff,0x8,0x9,0xA,0xB,0xC,0xD,0xE,0xF,
instead of:
First line of
despace_mask8_2
should look:instead of: