Closed weinbe2 closed 1 year ago
~When building, I'm getting the following warning (CUDA 12.0 / GCC 11)~
/home/kate/github/quda-develop-old/tests/host_reference/hisq_force_reference.cpp: In instantiation of ‘su3_matrix* get_su3_matrix(int, su3_matrix*, int, int) [with su3_matrix = fsu3_matrix]’:
/home/kate/github/quda-develop-old/tests/host_reference/hisq_force_reference.cpp:109:37: required from ‘void computeLinkOrderedOuterProduct(su3_vector*, su3_matrix*, size_t, int) [with su3_vector = fsu3_vector; su3_matrix = fsu3_matrix; size_t = long unsigned int]’
/home/kate/github/quda-develop-old/tests/host_reference/hisq_force_reference.cpp:118:35: required from here
/home/kate/github/quda-develop-old/tests/host_reference/hisq_force_reference.cpp:87:63: warning: unused parameter ‘gauge_order’ [-Wunused-parameter]
87 | template <typename su3_matrix> su3_matrix *get_su3_matrix(int gauge_order, su3_matrix *p, int idx, int dir)
Edit: this was caused by me using a stale local copy of the branch. Error is not present in HEAD.
Something I just noticed from testing, is that the hisq_paths_force_test
test will wrongly complain of failing if --verify false
is pass. The correctness check should not be applied in this case.
Something I just noticed from testing, is that the
hisq_paths_force_test
test will wrongly complain of failing if--verify false
is pass. The correctness check should not be applied in this case.
https://github.com/lattice/quda/pull/1367/commits/9e9845e5827a35ac61349c89a718f0ad192014e7
@mathiaswagner are you wanting to review this before we merge?
I guess this has seen enough testing and I am not sure I'll have cycles this week so feel free to go ahead with the merge.
This PR constitutes a large refactor and optimization of the implementation of the fat/long force computation in QUDA. It introduces multiple different groups of optimizations:
Five- and seven-link terms
Three-link and Lepage terms
Ancillary detriments and benefits from fusion
Qmu
field, which was ultimately a shifted version of the originalU
gauge link field, so the memory bloat from this PR is relatively minimal.Gauge field compression
Testing and timing
General clean-up of FLOPS/bytes counts
There is still some remaining clean-up to be done in this PR, none of which block opening this PR sooner as opposed to later:
clang-format
With regards to gauge compression, there is still outstanding work to enable recon-12 for the second step of the HISQ force chain rule because the
U
field is an SU(3) field (as opposed to the first step featuring the U(3)W
field). Since this corresponds to a marginal gain in memory savings (relative to enabling recon-13) at the expense of a decent amount of coding headaches, we're going to punt this to a subsequent PR.As a function of the problem size and the geometry of the decomposition, this can lead to a ~30+% performance boost in the computation of the HISQ force. This is due to the large amount of kernel fusion (and the corresponding cache reuse from it) as well as the introduction of gauge reconstruction for the
U
/W
field and a reduction of the depth of the halo that needs to be explicitly looped over as part of the computation.