Add fine-grained parallelism + matrix tiling to computeCoarseClover

lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.

https://lattice.github.io/quda

Other

279 stars 94 forks source link

Add fine-grained parallelism + matrix tiling to computeCoarseClover #1038

Open weinbe2 opened 3 years ago

weinbe2 commented 3 years ago

The routine computeCoarseClover: https://github.com/lattice/quda/blob/develop/include/kernels/coarse_op_kernel.cuh#L1014

Does not exploit a huge amount of parallelism as implemented, which turns into a bit of a nightmare when autotuning and could be a blocker in use-cases where not coarsening the preconditioned op is desirable.

maddyscientist commented 3 years ago

Just to note that computeCoarseClover already has fine-grain parallelism, and that #1050 improves the performance significantly of this kernel, although it does not yet reformulate it using the matrix-tiling abstraction.