void poly_tobytes(uint8_t r[KYBER_POLYBYTES], const poly *a)
{
unsigned int i;
uint16_t t0, t1;
for (i = 0; i < KYBER_N / 2; i++)
{
// map to positive standard representatives
// REF-CHANGE: Hoist signed-to-unsigned conversion into separate function
t0 = scalar_signed_to_unsigned_q_16(a->coeffs[2 * i]);
t1 = scalar_signed_to_unsigned_q_16(a->coeffs[2 * i + 1]);
r[3 * i + 0] = (t0 >> 0);
r[3 * i + 1] = (t0 >> 8) | (t1 << 4);
r[3 * i + 2] = (t1 >> 4);
}
}
of poly_frombytes() is very slow due to the calls to scalar_signed_to_unsigned_q_16(), and -- within there -- cmov_int16. This accounts for ~20% performance loss on a MLKEM-512 key generation on Apple M1.
Our adaptation
of
poly_frombytes()
is very slow due to the calls toscalar_signed_to_unsigned_q_16()
, and -- within there --cmov_int16
. This accounts for ~20% performance loss on a MLKEM-512 key generation on Apple M1.When using the reference implementation
in turn, the Apple compiler does an amazing job vectorizing the code.
We don't want the compiler to introduce a branch for
((int16_t)t1 >> 15) & KYBER_Q;
, but in practice, this does not seem to happen.