At the moment the exponent window array is mallocated once per slot (see modexp<...>::modexp(...)), whereas it doesn't make a lot of sense to use the function unless all the exponents in the warp (or even the thread block) are the same.
Also, mallocing all that data is computationally expensive.
Also it might blow the 8MB default heap size, which would require manually managing the heap size from outside the modexp function call, which would be a pain in the neck.
From https://github.com/data61/cuda-fixnum/issues/41: