There is an implementation of this in the attic that definitely performs better than the full version. It would be nicer still to have a single version that works well in both cases; this might be achieved by calculating max(bitlen(exp[i])) over all the exponents and starting the main loop from there.
From https://github.com/data61/cuda-fixnum/issues/25: