[cuda] Assume Maxwell and Pascal are dual issue

The C Programming Guide says that recent (5.x and later) architectures can issue a single instruction per cycle, but other nvidia documentation[1] says that dual issue is still possible. This is coherent with generated SASS assembly, as well as practical measurements: the bounds from the performance model are way too high for some kernels, and changing this setting drops them to reasonable levels.

As noted in the nvidia blog[1], there are apparently more limitations on those architectures however; such as only being able to issue one load and one arithmetic operation at the same time (and SASS examination shows that there might be additional restrictions, e.g. it looks like in some cases, 128-bit loads can't be dual-issued). This is not an issue, since the performance model is optimistic anyways.

Hence, this patch changes the performance model to ignore the Programming Guide and assume dual-issue for those architectures.

1: https://devblogs.nvidia.com/5-things-you-should-know-about-new-maxwell-gpu-architecture/

ulysseB / telamon

[cuda] Assume Maxwell and Pascal are dual issue #283