`poly_tobytes()` is too slow

Our adaptation

void poly_tobytes(uint8_t r[KYBER_POLYBYTES], const poly *a)
{
    unsigned int i;
    uint16_t t0, t1;

    for (i = 0; i < KYBER_N / 2; i++)
    {
        // map to positive standard representatives
        // REF-CHANGE: Hoist signed-to-unsigned conversion into separate function
        t0 = scalar_signed_to_unsigned_q_16(a->coeffs[2 * i]);
        t1 = scalar_signed_to_unsigned_q_16(a->coeffs[2 * i + 1]);
        r[3 * i + 0] = (t0 >> 0);
        r[3 * i + 1] = (t0 >> 8) | (t1 << 4);
        r[3 * i + 2] = (t1 >> 4);
    }
}

of poly_frombytes() is very slow due to the calls to scalar_signed_to_unsigned_q_16(), and -- within there -- cmov_int16. This accounts for ~20% performance loss on a MLKEM-512 key generation on Apple M1.

When using the reference implementation

void poly_tobytes(uint8_t r[KYBER_POLYBYTES], const poly *ap)
{
  unsigned int i;
  uint16_t t0, t1;
  const int16_t *a = (const int16_t*) ap;

  for(i=0;i<KYBER_N/2;i++) {
    // map to positive standard representatives
    t0  = a[2*i];
    t0 += ((int16_t)t0 >> 15) & KYBER_Q;
    t1 = a[2*i+1];
    t1 += ((int16_t)t1 >> 15) & KYBER_Q;
    r[3*i+0] = (t0 >> 0);
    r[3*i+1] = (t0 >> 8) | (t1 << 4);
    r[3*i+2] = (t1 >> 4);
  }
}

in turn, the Apple compiler does an amazing job vectorizing the code.

We don't want the compiler to introduce a branch for ((int16_t)t1 >> 15) & KYBER_Q;, but in practice, this does not seem to happen.

pq-code-package / mlkem-c-aarch64

`poly_tobytes()` is too slow #152