Open weinbe2 opened 3 years ago
Just to note that computeCoarseClover
already has fine-grain parallelism, and that #1050 improves the performance significantly of this kernel, although it does not yet reformulate it using the matrix-tiling abstraction.
The routine
computeCoarseClover
: https://github.com/lattice/quda/blob/develop/include/kernels/coarse_op_kernel.cuh#L1014Does not exploit a huge amount of parallelism as implemented, which turns into a bit of a nightmare when autotuning and could be a blocker in use-cases where not coarsening the preconditioned op is desirable.