When running the test suite, modexp (CLNW) seems faster than multi_modexp (k-ary) (at least in the 128 & 256 byte range), though this doesn't really make sense, since CLNW branches based on the bit pattern of the exponent whereas k-ary does not.
Work out what's going on. Replace k-ary if necessary.
From https://github.com/data61/cuda-fixnum/issues/43: