Level of register tiling is controlled by a CMake parameter QUDA_MAX_MULTI_RHS_TILE, with the default left at size 1 for now.
This feature will be further developed in subsequent PRs
(Although not included in this PR, it's straightforward to this support to other stencils)
Fixes performance regressions of the MMA dslash when the memory pool is switched off
The FieldTmp now supports creating temporaries using parameters as opposed another field instance
We use this to create the temporary used for the reordered quark fields
Adds possible WAR for performance regressions with ROCm
Various fixes for nvc++ compilation
Add alternative sentinel for heterogeneous reductions in the case that the compiler optimizes away non-finite math (enabled with QUDA_HETEROGENEOUS_ATOMIC_INF_INIT=OFF). Not a problem by default, but is with latest clang with -Ofast.
Fix various compiler warnings with more recent compilers, e.g., gcc-15
This PR is a bit of a catch all
QUDA_MAX_MULTI_RHS_TILE
, with the default left at size 1 for now.FieldTmp
now supports creating temporaries using parameters as opposed another field instanceQUDA_HETEROGENEOUS_ATOMIC_INF_INIT=OFF
). Not a problem by default, but is with latest clang with-Ofast
.