The C Programming Guide says that recent (5.x and later)
architectures can issue a single instruction per cycle, but other nvidia
documentation[1] says that dual issue is still possible. This is
coherent with generated SASS assembly, as well as practical
measurements: the bounds from the performance model are way too high for
some kernels, and changing this setting drops them to reasonable levels.
As noted in the nvidia blog[1], there are apparently more limitations on
those architectures however; such as only being able to issue one load
and one arithmetic operation at the same time (and SASS examination
shows that there might be additional restrictions, e.g. it looks like in
some cases, 128-bit loads can't be dual-issued). This is not an issue,
since the performance model is optimistic anyways.
Hence, this patch changes the performance model to ignore the
Programming Guide and assume dual-issue for those architectures.
The C Programming Guide says that recent (5.x and later) architectures can issue a single instruction per cycle, but other nvidia documentation[1] says that dual issue is still possible. This is coherent with generated SASS assembly, as well as practical measurements: the bounds from the performance model are way too high for some kernels, and changing this setting drops them to reasonable levels.
As noted in the nvidia blog[1], there are apparently more limitations on those architectures however; such as only being able to issue one load and one arithmetic operation at the same time (and SASS examination shows that there might be additional restrictions, e.g. it looks like in some cases, 128-bit loads can't be dual-issued). This is not an issue, since the performance model is optimistic anyways.
Hence, this patch changes the performance model to ignore the Programming Guide and assume dual-issue for those architectures.
1: https://devblogs.nvidia.com/5-things-you-should-know-about-new-maxwell-gpu-architecture/