Use SIMD for encoding - Githubissues

The code below handles the simplified case of ASCII words separated by exactly one space. The version that uses _mm512_mask_compressstoreu_epi8 runs at barely 190 MB/s. Using separate calls to _mm512_mask_compress_epi8 and _mm512_storeu_epi8 instead brings the throughput to almost 900 MB/s, still far from 1.4 GB/s for the scalar version.

With the separate calls, 90% of the cycles are spent in _mm512_i64gather_epi64 (vpgatherqq), which is known to be slow. If it took 0 cycles, the throughput would increase to up to 3.5 GB/s. However, I am not aware of any alternative approach to replace individual characters with their Morse encoding. More recent processors might have a lower penalty. However, it remains costly, according to https://uops.info/table.html.

let a = _mm512_maskz_expandloadu_epi8(0x0101010101010101, chunk.as_ptr().offset(i as isize) as *const i8);
let a = _mm512_i64gather_epi64(a, ASCII_TO_QWORD2.as_ptr() as *const u8, 8);
let zero = _mm512_setzero_si512();
let non_zero = _mm512_cmpneq_epi8_mask(a, zero);
// _mm512_mask_compressstoreu_epi8 is very, very, slow
let a = _mm512_mask_compress_epi8(zero, non_zero, a);
_mm512_storeu_epi8(buf.as_mut_ptr().offset(cur as isize) as *mut i8, a);
//_mm512_mask_compressstoreu_epi8(buf.as_mut_ptr(), non_zero, a);
let len = non_zero.count_ones();
cur += len as usize;

qsantos / ripmors

Use SIMD for encoding #9