The 4-d preconditioned chiral fermion dslashes presently use multiple kernels to apply the preconditioned dslash. This is an issue to explore fusing these kernels, the motivation is two fold:
This will improve single node performance since it will reduce memory traffic.
If the 4-d dslash part is fused with other operations it will make the strong scaling better since there will be more compute to overlap with.
The 4-d preconditioned chiral fermion dslashes presently use multiple kernels to apply the preconditioned dslash. This is an issue to explore fusing these kernels, the motivation is two fold: