Open qsantos opened 3 months ago
The gather operation is necessary to load the 8 bytes from ASCII_TO_QWORD2
that represent up to 8 Morse symbols for the character in ASCII encoding.
However, using the binary representation used in decoding (0 for dit, 1 for dash), the Morse encoding of each character could be packed in a single byte. With 128 ASCII characters, this totals 1024 bits. This could fit in two ZMM registers. Look-up can then be achieved using _mm512_permutex2var_epi8
.
The code below handles the simplified case of ASCII words separated by exactly one space. The version that uses
_mm512_mask_compressstoreu_epi8
runs at barely 190 MB/s. Using separate calls to_mm512_mask_compress_epi8
and_mm512_storeu_epi8
instead brings the throughput to almost 900 MB/s, still far from 1.4 GB/s for the scalar version.With the separate calls, 90% of the cycles are spent in
_mm512_i64gather_epi64
(vpgatherqq
), which is known to be slow. If it took 0 cycles, the throughput would increase to up to 3.5 GB/s. However, I am not aware of any alternative approach to replace individual characters with their Morse encoding. More recent processors might have a lower penalty. However, it remains costly, according to https://uops.info/table.html.