Closed kostrzewa closed 2 years ago
Just to make sure I repeated with 605e92c83a32b68cd8f889b69aee8c66172b5010 and the issue still persists.
I am unable to reproduce this on my dev machine with dual GTX-1060 (and hence without inter-node comms) on a 16c32 lattice. There is some difference (0.42 seconds vs. 0.52 seconds) but nowhere near the factor of 5 observed on the 64c128 lattice on 4 Juwels Booster nodes.
Sorry to hear of the issue. Can you also include the profile.tsv files? (perhaps you already have, I’m just trying to get my ducks in a row for when I’m at my computer 🙃)
I had a look this evening. Nothing sticks out looking at the tune caches. Hopefully the profile.tsv files will be more revealing in this regard.
I had not let the jobs run all the way through, so I didn't actually have the profile files. I attach them below together with the corresponding tunecaches.
bad: tsv_b50f16a55.zip
good: tsv_be1766a51.zip
Thanks for the profiles.
There doesn't seem to be any performance regression, however, I can see from the profiles that significantly more iterations are being spent on the intermediate grid. E.g., from the first line of the profiles, we see that the number of dslash coarse applications on the intermediate grid is increasing from ~37k to ~ 41k.
One unexpected thing I see from looking at the profiles is that even-odd preconditioning doesn't seem to be applied at all on the coarsest grids. Is that intentional for some reason, or just an oversight?
I wonder if I can reproduce this issue using the multigrid_evolve_test internal to QUDA....
Can you give me instructions for how to reproduce this issue using tmLQCD as the driver? That might be the best way forward here.
Thanks for the profiles.
Thanks for taking a look!
One unexpected thing I see from looking at the profiles is that even-odd preconditioning doesn't seem to be applied at all on the coarsest grids. Is that intentional for some reason, or just an oversight?
Darn, that's an artefact from a test setup that I had in the input file. As a result, most of the MG inversions come from a correllator measurement at the end of the trajectory rather than from the HMC. For (unrelated) testing reasons, I had EO-precon turned off in the measurement... Rerunning now with the measurements turned off, sorry for wasting your time like this.
I wonder if I can reproduce this issue using the multigrid_evolve_test internal to QUDA....
It's on my TODO list to see what happens there. I don't see a reason why it should not be possible since at least with tmLQCD as a driver, the only thing that seems to be affected is updateMultigridQuda
, I just don't have these versions of QUDA compiled with tests enabled and it always takes me a while to translate between the tmLQCD input file and the QUDA command line arguments.
Can you give me instructions for how to reproduce this issue using tmLQCD as the driver? That might be the best way forward here.
Absolutely. Since I had to do this recently for an (unrelated, I think) problem for JSC support (our HMC apparently triggers a CPU bug which takes down nodes...) I can prepare a pre-packaged QUDA+tmLQCD bundle complete with a prepared bootstrapping script for compilation, jobscripts and a thermalised starting configuration.
One unexpected thing I see from looking at the profiles is that even-odd preconditioning doesn't seem to be applied at all on the coarsest grids. Is that intentional for some reason, or just an oversight?
Darn, that's an artefact from a test setup that I had in the input file. As a result, most of the MG inversions come from a correllator measurement at the end of the trajectory rather than from the HMC. For (unrelated) testing reasons, I had EO-precon turned off in the measurement... Rerunning now with the measurements turned off, sorry for wasting your time like this.
Actually, this might be an actual surprise since the profile from the run without the correlator measurement at the end looks very similar to the one with the measurement. Can I ask how you see this from the profile?
The reason that I say that it's surprising is that in the HMC,
1) our outer solve type is inv_param.solve_type = QUDA_DIRECT_PC_SOLVE
2) in the instance of QudaInvertParam
for the MG metadata, we set
mg_inv_param->solve_type = QUDA_DIRECT_SOLVE;
mg_inv_param->solution_type = QUDA_MAT_SOLUTION;
which should be correct accoding to https://github.com/lattice/quda/wiki/Multigrid-Solver
and then 3), the coarse solution type is:
mg_param->coarse_grid_solution_type[level] = (level == 0 && inv_param.solve_type == QUDA_DIRECT_SOLVE) ?
QUDA_MAT_SOLUTION :
QUDA_MATPC_SOLUTION;
such that with the present setup, coarse_grid_solution_type
should be QUDA_MATPC_SOLUTION
on all levels and for the smoother we set:
mg_param->smoother_solve_type[level] = QUDA_DIRECT_PC_SOLVE;
and we also set:
mg_inv_param->matpc_type = QUDA_MATPC_ODD_ODD;
The way we set up and call the solver is a bit all over the place for historical reasons. In the HMC, this is the entry point in the calculation of the fermionic force for the degenerate doublet: https://github.com/etmc/tmLQCD/blob/1274c22074db2f7c476401de6f3b0dc6294e3dce/quda_interface.c#L2075 and then this super-convoluted construct https://github.com/etmc/tmLQCD/blob/1274c22074db2f7c476401de6f3b0dc6294e3dce/quda_interface.c#L1327 sets the parameters and generates the MG setup if required. As far as I can tell, what I said above in https://github.com/lattice/quda/issues/1287#issuecomment-1148546141 should be correct.
Updated profiles and tunecaches. I've included the log files (not that they will necessarily be innately useful to you at this stage).
Good: tsv_out_be1766a51.zip Bad: tsv_out_b50f16a55.zip
Can you give me instructions for how to reproduce this issue using tmLQCD as the driver? That might be the best way forward here.
Absolutely. Since I had to do this recently for an (unrelated, I think) problem for JSC support (our HMC apparently triggers a CPU bug which takes down nodes...) I can prepare a pre-packaged QUDA+tmLQCD bundle complete with a prepared bootstrapping script for compilation, jobscripts and a thermalised starting configuration.
Should I deposit this on nvsocal2 when I've put it together?
@kostrzewa depositing on nvsocal2 is perfect, thanks. Today is a bit of a disaster so I probably won’t get to looking at the above tunecache outputs until tomorrow.
I may be seeing this on another system.... I will go back to the 'before' version and recheck.
So a small update here.
computeVUV
kernel (see #1291 and #1292)In the meantime, @kostrzewa if you can get your tmLQCD reproducer setup for me that would be great. Thanks.
Thanks for the update @maddyscientist. Collecting the reproducer right now, the week was a bit full and I hadn't found the time yet, sorry about that.
No worries @kostrzewa, we’re all busy people.
I did wonder if you might be able to narrow down the bad commit range a bit? While I was hunting for the issue in #1291, I did find that most of the commits in the two step tuning PR development (#1260) were broken for multi gpu, so if you git bisect skip
avoided those we should get the range down significantly. (In retrospect we should have squashed that PR into a single commit)
@maddyscientist reproducer is landing on nvsocal2 -> /scratch/tmLQCD_updateMultigridQuda_regression.tar
Let me know if things are unclear in the README.md
I did wonder if you might be able to narrow down the bad commit range a bit?
I'm afraid not. I did two git bisect
exercises last weekend with some 6-8 steps each and had to skip quite a few commits to find the "last good" and "earliest bad" commit reported to above. My first git bisect
was all the way back to 227ff8c8 ;)
@maddyscientist When you're testing: don't be surprised about the number of MG iterations coming from the heatbath of the cloverdetratio2light monomial. This is a problem that we still need to figure out and has to do with how to properly re-use the MG setup for systems with large (tm_rho
) shifts. Since it has a small impact on overall performance, it's still at the back of my TODO list but will definitely be tackled.
Thanks for the comprehensive instructions @kostrzewa. I see the lattice size here is 64^3x128: does the issue occur on more manageable problem sizes, e.g., one that I can run on my 3x GV100 workstation? A 32^3x64 test for example?
Thanks for the comprehensive instructions @kostrzewa. I see the lattice size here is 64^3x128: does the issue occur on more manageable problem sizes, e.g., one that I can run on my 3x GV100 workstation? A 32^3x64 test for example?
Testing with a 32c64 now and will upload another .tar
to nvsocal2 once the test is through.
I'm afraid I can't reproduce it without inter-node comms but I'll upload it anyway, maybe if a particular comms policy is chosen, it will trigger the issue even on a single node.
-> updateMultigridQuda_regression_Jun_2022_cA211_30_32.tar
That’s good data in itself that it’s internode comms triggered.
@mathiaswagner is there anything in your recent nvshmem PR that could lead to overhead somehow for non-NVSHMEM?
@kostrzewa on the smaller lattice if you force it to be inter node does it trigger?
@kostrzewa on the smaller lattice if you force it to be inter node does it trigger?
Indeed it appears so -> nvsocal2:/scratch/juwboost_2_nodes_1gpu_per_node_updateMultigridQuda_regression_Jun_2022_cA211_30_32.tar
left: b50f16a55 (bad) right: be1766a51 (good)
On the left, the solver itself appears to be about 10% slower, but this might be a fluctuation, since these were not the same nodes.
The call to updateMultigridQuda
is slower by more than a factor of 10, however.
[...]
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.041667 # TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.041667
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 6.865442e+00 s level: 4 proc_id: 0 /HMC/clove | # TM_QUDA: Time for MG_Preconditioner_Setup_Update 4.841810e-01 s level: 4 proc_id: 0 /HMC/clove
# TM_QUDA: Time for reorder_spinor_eo_toQuda 2.635404e-02 s level: 4 proc_id: 0 /HMC/cloverdetra | # TM_QUDA: Time for reorder_spinor_eo_toQuda 1.695434e-02 s level: 4 proc_id: 0 /HMC/cloverdetra
GCR: Convergence at 29 iterations, L2 relative residual: iterated = 2.212307e-10, true = 2.21230 | GCR: Convergence at 29 iterations, L2 relative residual: iterated = 2.065740e-10, true = 2.06574
# TM_QUDA: Time for invertQuda 9.237861e-01 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:clov | # TM_QUDA: Time for invertQuda 8.580463e-01 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:clov
# TM_QUDA: Qp solve done: 29 iter / 0.843115 secs = 2956.07 Gflops | # TM_QUDA: Qp solve done: 29 iter / 0.770503 secs = 3279.17 Gflops
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 1.633094e-02 s level: 4 proc_id: 0 /HMC/cloverdet | # TM_QUDA: Time for reorder_spinor_eo_fromQuda 1.623341e-02 s level: 4 proc_id: 0 /HMC/cloverdet
# TM_QUDA: Time for mul_gamma5 1.714020e-04 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:clov | # TM_QUDA: Time for mul_gamma5 2.321620e-04 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:clov
# TM_QUDA: Time for reorder_spinor_eo_toQuda 1.702367e-02 s level: 4 proc_id: 0 /HMC/cloverdetra | # TM_QUDA: Time for reorder_spinor_eo_toQuda 1.781010e-02 s level: 4 proc_id: 0 /HMC/cloverdetra
# TM_QUDA: Time for updateMultigridQuda_sign_flip 6.808646e+00 s level: 4 proc_id: 0 /HMC/clover | # TM_QUDA: Time for updateMultigridQuda_sign_flip 5.052035e-01 s level: 4 proc_id: 0 /HMC/clover
GCR: Convergence at 31 iterations, L2 relative residual: iterated = 2.405399e-10, true = 2.40539 | GCR: Convergence at 31 iterations, L2 relative residual: iterated = 2.408186e-10, true = 2.40818
# TM_QUDA: Time for invertQuda 1.058032e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:clov | # TM_QUDA: Time for invertQuda 1.031578e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:clov
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 2.990364e-02 s level: 4 proc_id: 0 /HMC/cloverdet | # TM_QUDA: Time for reorder_spinor_eo_fromQuda 1.599799e-02 s level: 4 proc_id: 0 /HMC/cloverdet
# TM_QUDA: QpQm solve done: 31 iter / 1.0057 secs = 3046.22 Gflops | # TM_QUDA: QpQm solve done: 31 iter / 0.942064 secs = 3251.93 Gflops
[...]
Some more detective work: the regression occurred from one of these pull requests:
@kostrzewa I know this is a bit of a pain, but could you possibly test the merge points associated with these merges for the first four PRs above? These should all be stable. This will help immensely with narrowing this down.
Regarding running the tmLQCD reproducer. I ran into problems at the linking stage with tmLQCD:
/usr/bin/ld: ./lib/libhmc.a(update_momenta_fg.o): in function `update_momenta_fg':
update_momenta_fg.c:(.text+0x188): undefined reference to `calculate_fg'
collect2: error: ld returned 1 exit status
/install_dir/libmake: *** Waiting for unfinished jobs....
/usr/bin/ld: ./lib/libhmc.a(update_momenta_fg.o): in function `update_momenta_fg':
update_momenta_fg.c:(.text+0x188): undefined reference to `calculate_fg'
collect2: error: ld returned 1 exit status
make: *** [Makefile:136: offline_measurement] Error 1
/usr/bin/ld: ./lib/libhmc.a(update_momenta_fg.o): in function `update_momenta_fg':
update_momenta_fg.c:(.text+0x188): undefined reference to `calculate_fg'
collect2: error: ld returned 1 exit status
make: *** [Makefile:136: hmc_tm] Error 1
/usr/bin/ld: ./lib/libhmc.a(update_momenta_fg.o): in function `update_momenta_fg':
update_momenta_fg.c:(.text+0x188): undefined reference to `calculate_fg'
collect2: error: ld returned 1 exit status
make: *** [Makefile:136: invert] Error 1
@kostrzewa I know this is a bit of a pain, but could you possibly test the merge points associated with these merges for the first four PRs above? These should all be stable. This will help immensely with narrowing this down.
Sure, this won't be much of a problem. With the 32c64 test these can be run on the development partition so they go through quickly. I might get around to it this weekend.
Regarding running the tmLQCD reproducer. I ran into problems at the linking stage with tmLQCD:
Ugh, our "build system" is really the gift that keeps on giving. There are link order problems basically on every other machine / compiler version that we try it on but I simply haven't found the time to switch us over to CMake or a real autotools build.
In this case I'm a bit surprised, however, since the calculate_fg
function is inline in the same translation unit as the function which cannot find it (update_momenta_fg.c
).
In this case I'm a bit surprised, however, since the
calculate_fg
function is inline in the same translation unit as the function which cannot find it (update_momenta_fg.c
).
I literally hit this issue yesterday, it's related to this: https://stackoverflow.com/questions/19068705/undefined-reference-when-calling-inline-function ; https://stackoverflow.com/questions/16245521/c99-inline-function-in-c-file/16245669#16245669 . tl;dr "proper" C99 requires an inline
function still has external exposure (which does manifest itself as an issue even when it's literally defined in the same function), static inline
gets around this in very rough terms. GCC is the only compiler that does this "right" and icc
and clang
apparently don't really care (insofar as they do what the C++ standard does, which is what we're generally used to).
The easiest way I found to "deal" with it, at least to unblock myself, is delete the inline
until it matters.
@kostrzewa I know this is a bit of a pain, but could you possibly test the merge points associated with these merges for the first four PRs above? These should all be stable. This will help immensely with narrowing this down.
Just to make sure I've picked the right ones, I'm just now compiling commits:
The easiest way I found to "deal" with it, at least to unblock myself, is delete the inline until it matters.
@weinbe2 I think that's the way to go, yes. In the case at hand it was a bit of a premature optimisation anyway, I think. The funny thing is that I've encountered these elsewhere in the codebase previously when switching GCC versions, but the current stack has been compiled as is with GCC 8.x, 9.x, 10.x and 11.x and it never complained :)
@maddyscientist I've removed this instance of inline
in 8715f0a203b19f2c59af170717eb38aac6bbac61 of https://github.com/etmc/tmlqcd/tree/quda_work_add_actions. You should be able to just replace the tmLQCD source in the reproducer.
Those are the correct commits, thanks @kostrzewa.
One thing @mathiaswagner noticed about the job scripts for tmLQCD is that QUDA’s memory pools are being disabled. What’s the motivation for doing that? Especially disabling the pinned pool allocator could have a significant detrimental effect on performance.
One thing @mathiaswagner noticed about the job scripts for tmLQCD is that QUDA’s memory pools are being disabled. What’s the motivation for doing that? Especially disabling the pinned pool allocator could have a significant detrimental effect on performance.
We had (severe) issues with device memory usage when the device pool allocator was enabled:
https://github.com/etmc/tmLQCD/issues/507
this forced us to run on more GPUs than one would naively expect. At first we believed to be seeing some kind of memory leak but then it turned out that this was the behaviour of the device pool memory allocator. I then added a mechanism to tmLQCD which disables it by default (by setting the environment variable) with the option of enabling it through the input file.
We never saw any performance degradation, but perhaps today is the day?
In particular, the behaviour described in https://github.com/etmc/tmLQCD/issues/507#issuecomment-1017675188 (QUDA_ENABLE_DEVICE_MEMORY_POOL=0
) and in https://github.com/etmc/tmLQCD/issues/507#issuecomment-1017950206 (QUDA_ENABLE_DEVICE_MEMORY_POOL=1
) suggests that after a while, memory usage with the device pool allocator settles down, but memory usage is a bit too high for our purposes.
Of course, we could easily switch to QUDA_ENABLE_PINNED_MEMORY_POOL=1
as a default without hitting these issues with device memory usage, if I understand correctly.
It’s probably fine to run without the device allocator. There is some performance effect from this but most allocations in the mg solver are persistent so it should be minimized. The general practice I follow is to enable it unless pushed for memory and only disable it if needed.
The pinned memory pool should never be disabled though unless one is debugging for some reason. On some systems allocating pinned host memory can be expensive, though perhaps it’s not a problem on Booster.
The pinned memory pool should never be disabled though unless one is debugging for some reason. On some systems allocating pinned host memory can be expensive, though perhaps it’s not a problem on Booster.
This is good to know, I'll change the default in tmLQCD. Thanks!
I've added jobs to today's set of tests with either or both pools enabled and don't see significant effect on performance (in particular to the test with the slow updateMultigridQuda
below). Here are the results:
https://github.com/lattice/quda/commit/00656fb63cf33e4b2fa559ec238d1cdf6ba8de5c | https://github.com/lattice/quda/commit/0123a9e1d9422ee8b6b3b1a48307b881a6cac53b | https://github.com/lattice/quda/commit/8e245159ebba8febfb8e493d1e3a2e98cfd7fbbb | https://github.com/lattice/quda/commit/2f9d800e47ff80a1d5986fcf20767c561928d4b9 | https://github.com/lattice/quda/commit/345b3335c7b3c8d61812c608730bc2696679ef04 | |
---|---|---|---|---|---|
updateMultigridQuda time |
6.8 s | 0.48 s | 0.48 s | 0.49 s | 0.48 s |
inversion | 0.8 s | 0.78 s | 0.8 s | 0.8 s | 0.75 s |
It seems that the culprit might be some special interaction of https://github.com/lattice/quda/commit/00656fb63cf33e4b2fa559ec238d1cdf6ba8de5c with the particular configuration of UCX / OpenMPI on the Booster perhaps?
Yes, it seems to be caused by #1228. I don’t think it’s an OpenMPI induced issue since @bjoo saw a similar issue on Perlmutter which I assume is using Cray MPI(?)
@mathiaswagner since this seems to be associated with your PR can you take the lead on understanding this regression?
And thanks @kostrzewa for running these tests. Very helpful for narrowing this down.
Thanks for pinning this down. As this thread has become quite long and I joined late a few questions:
Is the regression only in updateMultiGridQuda
or does it show up anywhere else?
No correctness issues and there’s no apparent regressions in the tunecaches or profiles.
The tsv files are included above https://github.com/lattice/quda/issues/1287#issuecomment-1148567556
@mathiaswagner @maddyscientist Thanks for looking into it! I will check with my colleagues to see if we can find a moment to test the behaviour also on Marconi 100, just to make sure that it's not a PEBKAC issue on my end (although if Balint also sees this I guess we'll likely find the same). I don't see any indication that the interface to updateMultigridQuda
changed with the nvshmem merge (looking at the definitions of QudaMultigridParam
and updateMultigridQuda
itself).
There are no correctness issues, "just" a performance degradation? Is the regression only in updateMultiGridQuda or does it show up anywhere else?
As far as I can tell, only in updateMultigridQuda
. I haven't seen correctness problems.
I've uploaded the outputs of the last round of tests to nvsocal2:/scratch/juwboost_2_nodes_1gpu_per_node_updateMultigridQuda_regression_Jun_2022_pinpoint.tar.gz
. These also include all tsv files of these particular tests, so you can better relate them to the table in https://github.com/lattice/quda/issues/1287#issuecomment-1159514762.
I performed the same test on marconi100, and I also observed that 00656fb was slower than be1766a in set up the multigrid and in the inversion itself
quda be1766a
# TM_QUDA: Time for MG_Preconditioner_Setup 7.883397e+01 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/MG_Preconditioner_Setup
# TM_QUDA: Time for invertQuda 2.569772e+01 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 1.499274e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 9.923218e-01 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Time for invertQuda 1.722216e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda
quda 00656fb
# TM_QUDA: Time for MG_Preconditioner_Setup 1.255415e+02 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/MG_Preconditioner_Setup
# TM_QUDA: Time for invertQuda 3.789582e+01 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 3.040175e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 3.156509e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Time for invertQuda 2.764697e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda
The test was done in
/m100_work/INF22_lqcd123_0/mgarofal/runs/m100_2_nodes_1gpu_per_node_updateMultigridQuda_regression_Jun_2022_cA211_30_32
in the two subdirectories
tmLQCD_164c415dd68faa22d34feb0ea514220413b9c448_quda_00656fb63cf33e4b2fa559ec238d1cdf6ba8de5c
tmLQCD_164c415dd68faa22d34feb0ea514220413b9c448_quda_be1766a519995f8c16b9541c300f883efabc3df8
good: be1766a_profile_0.zip bad: 00656fb_profile_0.zip
Thanks @Marcogarofalo. So on Volta + Power9 we appear to have performance regressions not only in updateMultigridQuda
, but also in the MG solver itself.
Could you please check if the CG for cloverdetratio1light
was also slower?
Finally, just to make sure: there was no tuning here, correct? These were the second runs (or second trajectories), where everything had already been tuned, right?
Hi, the CG used in cloverdetratio1light
has the same speed
quda be1766a
# TM_QUDA: Time for invertQuda 5.287022e-01 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 2.109095e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 1.758500e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 2.057996e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 2.142247e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda
quda 00656fb
# TM_QUDA: Time for invertQuda 5.472805e-01 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 2.132907e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 2.027382e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 2.224253e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 1.733319e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda
Finally, just to make sure: there was no tuning here, correct? These were the second runs (or second trajectories), where everything had already been tuned, right?
I think the tuning was already done, I checked again with a second run and I observed similar performances.
Hi, the CG used in cloverdetratio1light has the same speed
Thanks! I was asking because of the ~ factor of 2 in the MG setup.
I think the tuning was already done, I checked again with a second run and I observed similar performances.
Excellent. In that case we can be pretty sure that we are indeed observing a regression also in the MG solver (at least on Volta + Power9).
@kostrzewa / @Marcogarofalo I found an issue introduced with #1228 and have a fix in the hotfix/shmem_exchangeghost_regression
branch. Do you mind giving that a spin?
I don't have a good reproducer for your issue yet but noticed that during re-reviewing the code changes.
Do you mind giving that a spin?
Compiling now on Juwels Booster. I should be able to get back to you in a few hours.
hotfix/shmem_exchangeghost_regression
does the trick on Juwels Booster. In fact, updateMultigridQuda
is about 20% faster now for the 32c64 test case on two GPUs. Thanks!
Excellent news 😄
Excellent!!! I shall pull it 😊
Best B
-- Balint Joo, Oak Ridge Leadership Computing Facility, Oak Ridge National Laboratory P.O. Box 2008, 1 Bethel Valley Road, Oak Ridge, TN 37831, USA email: joob AT ornl.gov. Tel: +1-757-912-0566 (cell, remote)
From: Bartosz Kostrzewa @.> Reply-To: lattice/quda @.> Date: Friday, June 24, 2022 at 11:40 AM To: lattice/quda @.> Cc: Balint Joo @.>, Mention @.> Subject: [EXTERNAL] Re: [lattice/quda] updateMultigridQuda performance regression sometime after be1766a519995f8c16b9541c300f883efabc3df8 (Issue #1287) Resent-From: @.> Resent-Date: Friday, June 24, 2022 at 11:40 AM
hotfix/shmem_exchangeghost_regression does the trick on Juwels Booster. In fact, updateMultigridQuda is about 20% faster now for the 32c64 test case on two GPUs. Thanks!
— Reply to this email directly, view it on GitHubhxxps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_lattice_quda_issues_1287-23issuecomment-2D1165695457&d=DwMCaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=WG01dcOd8ZFLFFNkrpbj6A&m=Zw0zbqBzNF7GnhrvEDSbvOExq3dGBhHgbKfill9MjsgUx92qsLMr51Sc0iK9lWQZ&s=KdRwfeEWGI5H6JeiJZ_4TyuVSfTSyXKtkYjSmUBff0g&e=, or unsubscribehxxps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAEPL2FTGCA6CXJXZ6HXMF3VQXJE5ANCNFSM5X4Z7X6A&d=DwMCaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=WG01dcOd8ZFLFFNkrpbj6A&m=Zw0zbqBzNF7GnhrvEDSbvOExq3dGBhHgbKfill9MjsgUx92qsLMr51Sc0iK9lWQZ&s=GLvUXDd_g51c9xtjkeihBnKbcjYHq6KuocjymFgTYVU&e=. You are receiving this because you were mentioned.Message ID: @.***>
I've observed a (rather substantial) performance regression in
updateMultigridQuda
sometime after commit be1766a519995f8c16b9541c300f883efabc3df8. The earliest 'bad' commit (which I could compile and which ran without erroring out) that I could identify is b50f16a55d2eaae816c00812b3996138c5c4dfa9.Running on four nodes of Juwels Booster on a 64c128 lattice at the physical point, calling
updateMultigridQuda
(without refresh iterations):good
tunecache_be1766a51.tsv.zip
When it performs like this, the Update scales reasonably:
bad
tunecache_b50f16a55.tsv.zip
This poorly performing behaviour doesn't scale. With commit 9bae409b742765563b28939e93fa5c43b03da20b (which happened to be the develop head commit when I started observing this), I have:
This suggest perhaps a constant factor.
Testing reproduction on a different machine now.