qsantos / ripmors

Blazingly fast Morse code encoding/decoding
Apache License 2.0
13 stars 0 forks source link

Use SIMD for encoding #9

Open qsantos opened 3 months ago

qsantos commented 3 months ago

The code below handles the simplified case of ASCII words separated by exactly one space. The version that uses _mm512_mask_compressstoreu_epi8 runs at barely 190 MB/s. Using separate calls to _mm512_mask_compress_epi8 and _mm512_storeu_epi8 instead brings the throughput to almost 900 MB/s, still far from 1.4 GB/s for the scalar version.

With the separate calls, 90% of the cycles are spent in _mm512_i64gather_epi64 (vpgatherqq), which is known to be slow. If it took 0 cycles, the throughput would increase to up to 3.5 GB/s. However, I am not aware of any alternative approach to replace individual characters with their Morse encoding. More recent processors might have a lower penalty. However, it remains costly, according to https://uops.info/table.html.

let a = _mm512_maskz_expandloadu_epi8(0x0101010101010101, chunk.as_ptr().offset(i as isize) as *const i8);
let a = _mm512_i64gather_epi64(a, ASCII_TO_QWORD2.as_ptr() as *const u8, 8);
let zero = _mm512_setzero_si512();
let non_zero = _mm512_cmpneq_epi8_mask(a, zero);
// _mm512_mask_compressstoreu_epi8 is very, very, slow
let a = _mm512_mask_compress_epi8(zero, non_zero, a);
_mm512_storeu_epi8(buf.as_mut_ptr().offset(cur as isize) as *mut i8, a);
//_mm512_mask_compressstoreu_epi8(buf.as_mut_ptr(), non_zero, a);
let len = non_zero.count_ones();
cur += len as usize;
qsantos commented 3 months ago

The gather operation is necessary to load the 8 bytes from ASCII_TO_QWORD2 that represent up to 8 Morse symbols for the character in ASCII encoding.

However, using the binary representation used in decoding (0 for dit, 1 for dash), the Morse encoding of each character could be packed in a single byte. With 128 ASCII characters, this totals 1024 bits. This could fit in two ZMM registers. Look-up can then be achieved using _mm512_permutex2var_epi8.