updateMultigridQuda performance regression sometime after be1766a519995f8c16b9541c300f883efabc3df8

kostrzewa commented 2 years ago

I've observed a (rather substantial) performance regression in updateMultigridQuda sometime after commit be1766a519995f8c16b9541c300f883efabc3df8. The earliest 'bad' commit (which I could compile and which ran without erroring out) that I could identify is b50f16a55d2eaae816c00812b3996138c5c4dfa9.

Running on four nodes of Juwels Booster on a 64c128 lattice at the physical point, calling updateMultigridQuda (without refresh iterations):

good

# TM_QUDA: Time for MG_Preconditioner_Setup_Update 1.128518e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/MG_Preconditioner_Setup_Update

tunecache_be1766a51.tsv.zip

When it performs like this, the Update scales reasonably:

nodes	seconds
4	1.1
8	0.62
16	0.42
32	0.21

bad

# TM_QUDA: Time for MG_Preconditioner_Setup_Update 5.131241e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/MG_Preconditioner_Setup_Update

tunecache_b50f16a55.tsv.zip

This poorly performing behaviour doesn't scale. With commit 9bae409b742765563b28939e93fa5c43b03da20b (which happened to be the develop head commit when I started observing this), I have:

nodes	seconds
4	5.1
8	6.6
16	4.8

This suggest perhaps a constant factor.

Testing reproduction on a different machine now.

kostrzewa commented 2 years ago

Just to make sure I repeated with 605e92c83a32b68cd8f889b69aee8c66172b5010 and the issue still persists.

tunecache_605e92c83.tsv.zip

kostrzewa commented 2 years ago

I am unable to reproduce this on my dev machine with dual GTX-1060 (and hence without inter-node comms) on a 16c32 lattice. There is some difference (0.42 seconds vs. 0.52 seconds) but nowhere near the factor of 5 observed on the 64c128 lattice on 4 Juwels Booster nodes.

maddyscientist commented 2 years ago

Sorry to hear of the issue. Can you also include the profile.tsv files? (perhaps you already have, I’m just trying to get my ducks in a row for when I’m at my computer 🙃)

maddyscientist commented 2 years ago

I had a look this evening. Nothing sticks out looking at the tune caches. Hopefully the profile.tsv files will be more revealing in this regard.

kostrzewa commented 2 years ago

I had not let the jobs run all the way through, so I didn't actually have the profile files. I attach them below together with the corresponding tunecaches.

bad: tsv_b50f16a55.zip

good: tsv_be1766a51.zip

maddyscientist commented 2 years ago

Thanks for the profiles.

There doesn't seem to be any performance regression, however, I can see from the profiles that significantly more iterations are being spent on the intermediate grid. E.g., from the first line of the profiles, we see that the number of dslash coarse applications on the intermediate grid is increasing from ~37k to ~ 41k.

One unexpected thing I see from looking at the profiles is that even-odd preconditioning doesn't seem to be applied at all on the coarsest grids. Is that intentional for some reason, or just an oversight?

I wonder if I can reproduce this issue using the multigrid_evolve_test internal to QUDA....

maddyscientist commented 2 years ago

Can you give me instructions for how to reproduce this issue using tmLQCD as the driver? That might be the best way forward here.

kostrzewa commented 2 years ago

Thanks for the profiles.

Thanks for taking a look!

One unexpected thing I see from looking at the profiles is that even-odd preconditioning doesn't seem to be applied at all on the coarsest grids. Is that intentional for some reason, or just an oversight?

Darn, that's an artefact from a test setup that I had in the input file. As a result, most of the MG inversions come from a correllator measurement at the end of the trajectory rather than from the HMC. For (unrelated) testing reasons, I had EO-precon turned off in the measurement... Rerunning now with the measurements turned off, sorry for wasting your time like this.

I wonder if I can reproduce this issue using the multigrid_evolve_test internal to QUDA....

It's on my TODO list to see what happens there. I don't see a reason why it should not be possible since at least with tmLQCD as a driver, the only thing that seems to be affected is updateMultigridQuda, I just don't have these versions of QUDA compiled with tests enabled and it always takes me a while to translate between the tmLQCD input file and the QUDA command line arguments.

Can you give me instructions for how to reproduce this issue using tmLQCD as the driver? That might be the best way forward here.

Absolutely. Since I had to do this recently for an (unrelated, I think) problem for JSC support (our HMC apparently triggers a CPU bug which takes down nodes...) I can prepare a pre-packaged QUDA+tmLQCD bundle complete with a prepared bootstrapping script for compilation, jobscripts and a thermalised starting configuration.

kostrzewa commented 2 years ago

One unexpected thing I see from looking at the profiles is that even-odd preconditioning doesn't seem to be applied at all on the coarsest grids. Is that intentional for some reason, or just an oversight?

Darn, that's an artefact from a test setup that I had in the input file. As a result, most of the MG inversions come from a correllator measurement at the end of the trajectory rather than from the HMC. For (unrelated) testing reasons, I had EO-precon turned off in the measurement... Rerunning now with the measurements turned off, sorry for wasting your time like this.

Actually, this might be an actual surprise since the profile from the run without the correlator measurement at the end looks very similar to the one with the measurement. Can I ask how you see this from the profile?

The reason that I say that it's surprising is that in the HMC,

1) our outer solve type is inv_param.solve_type = QUDA_DIRECT_PC_SOLVE

2) in the instance of QudaInvertParam for the MG metadata, we set

  mg_inv_param->solve_type = QUDA_DIRECT_SOLVE;
  mg_inv_param->solution_type = QUDA_MAT_SOLUTION;

which should be correct accoding to https://github.com/lattice/quda/wiki/Multigrid-Solver

and then 3), the coarse solution type is:

    mg_param->coarse_grid_solution_type[level] = (level == 0 && inv_param.solve_type == QUDA_DIRECT_SOLVE) ? 
                                                    QUDA_MAT_SOLUTION : 
                                                    QUDA_MATPC_SOLUTION;

such that with the present setup, coarse_grid_solution_type should be QUDA_MATPC_SOLUTION on all levels and for the smoother we set:

    mg_param->smoother_solve_type[level] = QUDA_DIRECT_PC_SOLVE;

and we also set:

    mg_inv_param->matpc_type = QUDA_MATPC_ODD_ODD;

kostrzewa commented 2 years ago

The way we set up and call the solver is a bit all over the place for historical reasons. In the HMC, this is the entry point in the calculation of the fermionic force for the degenerate doublet: https://github.com/etmc/tmLQCD/blob/1274c22074db2f7c476401de6f3b0dc6294e3dce/quda_interface.c#L2075 and then this super-convoluted construct https://github.com/etmc/tmLQCD/blob/1274c22074db2f7c476401de6f3b0dc6294e3dce/quda_interface.c#L1327 sets the parameters and generates the MG setup if required. As far as I can tell, what I said above in https://github.com/lattice/quda/issues/1287#issuecomment-1148546141 should be correct.

kostrzewa commented 2 years ago

Updated profiles and tunecaches. I've included the log files (not that they will necessarily be innately useful to you at this stage).

Good: tsv_out_be1766a51.zip Bad: tsv_out_b50f16a55.zip

kostrzewa commented 2 years ago

Can you give me instructions for how to reproduce this issue using tmLQCD as the driver? That might be the best way forward here.

Absolutely. Since I had to do this recently for an (unrelated, I think) problem for JSC support (our HMC apparently triggers a CPU bug which takes down nodes...) I can prepare a pre-packaged QUDA+tmLQCD bundle complete with a prepared bootstrapping script for compilation, jobscripts and a thermalised starting configuration.

Should I deposit this on nvsocal2 when I've put it together?

maddyscientist commented 2 years ago

@kostrzewa depositing on nvsocal2 is perfect, thanks. Today is a bit of a disaster so I probably won’t get to looking at the above tunecache outputs until tomorrow.

bjoo commented 2 years ago

I may be seeing this on another system.... I will go back to the 'before' version and recheck.

maddyscientist commented 2 years ago

So a small update here.

The regression does not exhibit itself in the profiles or tunecache. There's no obvious increase in iterations being spent.
I did find a long-standing issue thanks to this analysis, the fix for which improves the performance of the computeVUV kernel (see #1291 and #1292)
So I'm bit stumped here as to what the problem is.
@bjoo reported to me that there's no significant regression observed on ROCm, so it appears to be a CUDA specific thing, so I'll run with that thought

In the meantime, @kostrzewa if you can get your tmLQCD reproducer setup for me that would be great. Thanks.

kostrzewa commented 2 years ago

Thanks for the update @maddyscientist. Collecting the reproducer right now, the week was a bit full and I hadn't found the time yet, sorry about that.

maddyscientist commented 2 years ago

No worries @kostrzewa, we’re all busy people.

I did wonder if you might be able to narrow down the bad commit range a bit? While I was hunting for the issue in #1291, I did find that most of the commits in the two step tuning PR development (#1260) were broken for multi gpu, so if you git bisect skip avoided those we should get the range down significantly. (In retrospect we should have squashed that PR into a single commit)

kostrzewa commented 2 years ago

@maddyscientist reproducer is landing on nvsocal2 -> /scratch/tmLQCD_updateMultigridQuda_regression.tar

Let me know if things are unclear in the README.md

I did wonder if you might be able to narrow down the bad commit range a bit?

I'm afraid not. I did two git bisect exercises last weekend with some 6-8 steps each and had to skip quite a few commits to find the "last good" and "earliest bad" commit reported to above. My first git bisect was all the way back to 227ff8c8 ;)

kostrzewa commented 2 years ago

@maddyscientist When you're testing: don't be surprised about the number of MG iterations coming from the heatbath of the cloverdetratio2light monomial. This is a problem that we still need to figure out and has to do with how to properly re-use the MG setup for systems with large (tm_rho) shifts. Since it has a small impact on overall performance, it's still at the back of my TODO list but will definitely be tackled.

maddyscientist commented 2 years ago

Thanks for the comprehensive instructions @kostrzewa. I see the lattice size here is 64^3x128: does the issue occur on more manageable problem sizes, e.g., one that I can run on my 3x GV100 workstation? A 32^3x64 test for example?

kostrzewa commented 2 years ago

Thanks for the comprehensive instructions @kostrzewa. I see the lattice size here is 64^3x128: does the issue occur on more manageable problem sizes, e.g., one that I can run on my 3x GV100 workstation? A 32^3x64 test for example?

Testing with a 32c64 now and will upload another .tar to nvsocal2 once the test is through.

kostrzewa commented 2 years ago

I'm afraid I can't reproduce it without inter-node comms but I'll upload it anyway, maybe if a particular comms policy is chosen, it will trigger the issue even on a single node.

kostrzewa commented 2 years ago

-> updateMultigridQuda_regression_Jun_2022_cA211_30_32.tar

maddyscientist commented 2 years ago

That’s good data in itself that it’s internode comms triggered.

@mathiaswagner is there anything in your recent nvshmem PR that could lead to overhead somehow for non-NVSHMEM?

maddyscientist commented 2 years ago

@kostrzewa on the smaller lattice if you force it to be inter node does it trigger?

kostrzewa commented 2 years ago

@kostrzewa on the smaller lattice if you force it to be inter node does it trigger?

Indeed it appears so -> nvsocal2:/scratch/juwboost_2_nodes_1gpu_per_node_updateMultigridQuda_regression_Jun_2022_cA211_30_32.tar

left: b50f16a55 (bad) right: be1766a51 (good)

On the left, the solver itself appears to be about 10% slower, but this might be a fluctuation, since these were not the same nodes.

The call to updateMultigridQuda is slower by more than a factor of 10, however.

[...]
# TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.041667                                      # TM_QUDA: Updating MG Preconditioner Setup for gauge_id: 0.041667
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 6.865442e+00 s level: 4 proc_id: 0 /HMC/clove   |    # TM_QUDA: Time for MG_Preconditioner_Setup_Update 4.841810e-01 s level: 4 proc_id: 0 /HMC/clove
# TM_QUDA: Time for reorder_spinor_eo_toQuda 2.635404e-02 s level: 4 proc_id: 0 /HMC/cloverdetra   |    # TM_QUDA: Time for reorder_spinor_eo_toQuda 1.695434e-02 s level: 4 proc_id: 0 /HMC/cloverdetra
GCR: Convergence at 29 iterations, L2 relative residual: iterated = 2.212307e-10, true = 2.21230   |    GCR: Convergence at 29 iterations, L2 relative residual: iterated = 2.065740e-10, true = 2.06574
# TM_QUDA: Time for invertQuda 9.237861e-01 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:clov   |    # TM_QUDA: Time for invertQuda 8.580463e-01 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:clov
# TM_QUDA: Qp solve done: 29 iter / 0.843115 secs = 2956.07 Gflops                                 |    # TM_QUDA: Qp solve done: 29 iter / 0.770503 secs = 3279.17 Gflops
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 1.633094e-02 s level: 4 proc_id: 0 /HMC/cloverdet   |    # TM_QUDA: Time for reorder_spinor_eo_fromQuda 1.623341e-02 s level: 4 proc_id: 0 /HMC/cloverdet
# TM_QUDA: Time for mul_gamma5 1.714020e-04 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:clov   |    # TM_QUDA: Time for mul_gamma5 2.321620e-04 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:clov
# TM_QUDA: Time for reorder_spinor_eo_toQuda 1.702367e-02 s level: 4 proc_id: 0 /HMC/cloverdetra   |    # TM_QUDA: Time for reorder_spinor_eo_toQuda 1.781010e-02 s level: 4 proc_id: 0 /HMC/cloverdetra
# TM_QUDA: Time for updateMultigridQuda_sign_flip 6.808646e+00 s level: 4 proc_id: 0 /HMC/clover   |    # TM_QUDA: Time for updateMultigridQuda_sign_flip 5.052035e-01 s level: 4 proc_id: 0 /HMC/clover
GCR: Convergence at 31 iterations, L2 relative residual: iterated = 2.405399e-10, true = 2.40539   |    GCR: Convergence at 31 iterations, L2 relative residual: iterated = 2.408186e-10, true = 2.40818
# TM_QUDA: Time for invertQuda 1.058032e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:clov   |    # TM_QUDA: Time for invertQuda 1.031578e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:clov
# TM_QUDA: Time for reorder_spinor_eo_fromQuda 2.990364e-02 s level: 4 proc_id: 0 /HMC/cloverdet   |    # TM_QUDA: Time for reorder_spinor_eo_fromQuda 1.599799e-02 s level: 4 proc_id: 0 /HMC/cloverdet
# TM_QUDA: QpQm solve done: 31 iter / 1.0057 secs = 3046.22 Gflops                                 |    # TM_QUDA: QpQm solve done: 31 iter / 0.942064 secs = 3251.93 Gflops
[...]

maddyscientist commented 2 years ago

Some more detective work: the regression occurred from one of these pull requests:

1260 (@mathiaswagner): merge 345b3335c7b3c8d61812c608730bc2696679ef04
1257 (@maddyscientist): merge 2f9d800e47ff80a1d5986fcf20767c561928d4b9
1263 (@mathiaswagner): merge 8e245159ebba8febfb8e493d1e3a2e98cfd7fbbb
1264 (@weinbe2): merge 0123a9e1d9422ee8b6b3b1a48307b881a6cac53b
1228 (@mathiaswagner) (specifically between commits be1766a519995f8c16b9541c300f883efabc3df8 and b50f16a55d2eaae816c00812b3996138c5c4dfa9)

@kostrzewa I know this is a bit of a pain, but could you possibly test the merge points associated with these merges for the first four PRs above? These should all be stable. This will help immensely with narrowing this down.

Regarding running the tmLQCD reproducer. I ran into problems at the linking stage with tmLQCD:


/usr/bin/ld: ./lib/libhmc.a(update_momenta_fg.o): in function `update_momenta_fg':
update_momenta_fg.c:(.text+0x188): undefined reference to `calculate_fg'
collect2: error: ld returned 1 exit status
/install_dir/libmake: *** Waiting for unfinished jobs....
/usr/bin/ld: ./lib/libhmc.a(update_momenta_fg.o): in function `update_momenta_fg':
update_momenta_fg.c:(.text+0x188): undefined reference to `calculate_fg'
collect2: error: ld returned 1 exit status
make: *** [Makefile:136: offline_measurement] Error 1
/usr/bin/ld: ./lib/libhmc.a(update_momenta_fg.o): in function `update_momenta_fg':
update_momenta_fg.c:(.text+0x188): undefined reference to `calculate_fg'
collect2: error: ld returned 1 exit status
make: *** [Makefile:136: hmc_tm] Error 1
/usr/bin/ld: ./lib/libhmc.a(update_momenta_fg.o): in function `update_momenta_fg':
update_momenta_fg.c:(.text+0x188): undefined reference to `calculate_fg'
collect2: error: ld returned 1 exit status
make: *** [Makefile:136: invert] Error 1

kostrzewa commented 2 years ago

@kostrzewa I know this is a bit of a pain, but could you possibly test the merge points associated with these merges for the first four PRs above? These should all be stable. This will help immensely with narrowing this down.

Sure, this won't be much of a problem. With the 32c64 test these can be run on the development partition so they go through quickly. I might get around to it this weekend.

Regarding running the tmLQCD reproducer. I ran into problems at the linking stage with tmLQCD:

Ugh, our "build system" is really the gift that keeps on giving. There are link order problems basically on every other machine / compiler version that we try it on but I simply haven't found the time to switch us over to CMake or a real autotools build.

In this case I'm a bit surprised, however, since the calculate_fg function is inline in the same translation unit as the function which cannot find it (update_momenta_fg.c).

weinbe2 commented 2 years ago

In this case I'm a bit surprised, however, since the calculate_fg function is inline in the same translation unit as the function which cannot find it (update_momenta_fg.c).

I literally hit this issue yesterday, it's related to this: https://stackoverflow.com/questions/19068705/undefined-reference-when-calling-inline-function ; https://stackoverflow.com/questions/16245521/c99-inline-function-in-c-file/16245669#16245669 . tl;dr "proper" C99 requires an inline function still has external exposure (which does manifest itself as an issue even when it's literally defined in the same function), static inline gets around this in very rough terms. GCC is the only compiler that does this "right" and icc and clang apparently don't really care (insofar as they do what the C++ standard does, which is what we're generally used to).

The easiest way I found to "deal" with it, at least to unblock myself, is delete the inline until it matters.

kostrzewa commented 2 years ago

@kostrzewa I know this is a bit of a pain, but could you possibly test the merge points associated with these merges for the first four PRs above? These should all be stable. This will help immensely with narrowing this down.

Just to make sure I've picked the right ones, I'm just now compiling commits:

00656fb63cf33e4b2fa559ec238d1cdf6ba8de5c
0123a9e1d9422ee8b6b3b1a48307b881a6cac53b
8e245159ebba8febfb8e493d1e3a2e98cfd7fbbb
2f9d800e47ff80a1d5986fcf20767c561928d4b9
345b3335c7b3c8d61812c608730bc2696679ef04

The easiest way I found to "deal" with it, at least to unblock myself, is delete the inline until it matters.

@weinbe2 I think that's the way to go, yes. In the case at hand it was a bit of a premature optimisation anyway, I think. The funny thing is that I've encountered these elsewhere in the codebase previously when switching GCC versions, but the current stack has been compiled as is with GCC 8.x, 9.x, 10.x and 11.x and it never complained :)

@maddyscientist I've removed this instance of inline in 8715f0a203b19f2c59af170717eb38aac6bbac61 of https://github.com/etmc/tmlqcd/tree/quda_work_add_actions. You should be able to just replace the tmLQCD source in the reproducer.

maddyscientist commented 2 years ago

Those are the correct commits, thanks @kostrzewa.

One thing @mathiaswagner noticed about the job scripts for tmLQCD is that QUDA’s memory pools are being disabled. What’s the motivation for doing that? Especially disabling the pinned pool allocator could have a significant detrimental effect on performance.

kostrzewa commented 2 years ago

One thing @mathiaswagner noticed about the job scripts for tmLQCD is that QUDA’s memory pools are being disabled. What’s the motivation for doing that? Especially disabling the pinned pool allocator could have a significant detrimental effect on performance.

We had (severe) issues with device memory usage when the device pool allocator was enabled:

https://github.com/etmc/tmLQCD/issues/507

this forced us to run on more GPUs than one would naively expect. At first we believed to be seeing some kind of memory leak but then it turned out that this was the behaviour of the device pool memory allocator. I then added a mechanism to tmLQCD which disables it by default (by setting the environment variable) with the option of enabling it through the input file.

We never saw any performance degradation, but perhaps today is the day?

kostrzewa commented 2 years ago

In particular, the behaviour described in https://github.com/etmc/tmLQCD/issues/507#issuecomment-1017675188 (QUDA_ENABLE_DEVICE_MEMORY_POOL=0) and in https://github.com/etmc/tmLQCD/issues/507#issuecomment-1017950206 (QUDA_ENABLE_DEVICE_MEMORY_POOL=1) suggests that after a while, memory usage with the device pool allocator settles down, but memory usage is a bit too high for our purposes.

Of course, we could easily switch to QUDA_ENABLE_PINNED_MEMORY_POOL=1 as a default without hitting these issues with device memory usage, if I understand correctly.

maddyscientist commented 2 years ago

It’s probably fine to run without the device allocator. There is some performance effect from this but most allocations in the mg solver are persistent so it should be minimized. The general practice I follow is to enable it unless pushed for memory and only disable it if needed.

The pinned memory pool should never be disabled though unless one is debugging for some reason. On some systems allocating pinned host memory can be expensive, though perhaps it’s not a problem on Booster.

kostrzewa commented 2 years ago

The pinned memory pool should never be disabled though unless one is debugging for some reason. On some systems allocating pinned host memory can be expensive, though perhaps it’s not a problem on Booster.

This is good to know, I'll change the default in tmLQCD. Thanks!

kostrzewa commented 2 years ago

I've added jobs to today's set of tests with either or both pools enabled and don't see significant effect on performance (in particular to the test with the slow updateMultigridQuda below). Here are the results:

	https://github.com/lattice/quda/commit/00656fb63cf33e4b2fa559ec238d1cdf6ba8de5c	https://github.com/lattice/quda/commit/0123a9e1d9422ee8b6b3b1a48307b881a6cac53b	https://github.com/lattice/quda/commit/8e245159ebba8febfb8e493d1e3a2e98cfd7fbbb	https://github.com/lattice/quda/commit/2f9d800e47ff80a1d5986fcf20767c561928d4b9	https://github.com/lattice/quda/commit/345b3335c7b3c8d61812c608730bc2696679ef04
`updateMultigridQuda` time	6.8 s	0.48 s	0.48 s	0.49 s	0.48 s
inversion	0.8 s	0.78 s	0.8 s	0.8 s	0.75 s

It seems that the culprit might be some special interaction of https://github.com/lattice/quda/commit/00656fb63cf33e4b2fa559ec238d1cdf6ba8de5c with the particular configuration of UCX / OpenMPI on the Booster perhaps?

maddyscientist commented 2 years ago

Yes, it seems to be caused by #1228. I don’t think it’s an OpenMPI induced issue since @bjoo saw a similar issue on Perlmutter which I assume is using Cray MPI(?)

@mathiaswagner since this seems to be associated with your PR can you take the lead on understanding this regression?

maddyscientist commented 2 years ago

And thanks @kostrzewa for running these tests. Very helpful for narrowing this down.

mathiaswagner commented 2 years ago

Thanks for pinning this down. As this thread has become quite long and I joined late a few questions:

There are no correctness issues, "just" a performance degradation?
Do we know whether this is a kernel that regressed between a good and bad commit ? @kostrzewa can you maybe share the tune caches for bad and latest good commit?

Is the regression only in updateMultiGridQuda or does it show up anywhere else?

maddyscientist commented 2 years ago

No correctness issues and there’s no apparent regressions in the tunecaches or profiles.

The tsv files are included above https://github.com/lattice/quda/issues/1287#issuecomment-1148567556

kostrzewa commented 2 years ago

@mathiaswagner @maddyscientist Thanks for looking into it! I will check with my colleagues to see if we can find a moment to test the behaviour also on Marconi 100, just to make sure that it's not a PEBKAC issue on my end (although if Balint also sees this I guess we'll likely find the same). I don't see any indication that the interface to updateMultigridQuda changed with the nvshmem merge (looking at the definitions of QudaMultigridParam and updateMultigridQuda itself).

There are no correctness issues, "just" a performance degradation? Is the regression only in updateMultiGridQuda or does it show up anywhere else?

As far as I can tell, only in updateMultigridQuda. I haven't seen correctness problems.

I've uploaded the outputs of the last round of tests to nvsocal2:/scratch/juwboost_2_nodes_1gpu_per_node_updateMultigridQuda_regression_Jun_2022_pinpoint.tar.gz. These also include all tsv files of these particular tests, so you can better relate them to the table in https://github.com/lattice/quda/issues/1287#issuecomment-1159514762.

Marcogarofalo commented 2 years ago

I performed the same test on marconi100, and I also observed that 00656fb was slower than be1766a in set up the multigrid and in the inversion itself

quda be1766a

# TM_QUDA: Time for MG_Preconditioner_Setup 7.883397e+01 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/MG_Preconditioner_Setup
# TM_QUDA: Time for invertQuda 2.569772e+01 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 1.499274e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 9.923218e-01 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Time for invertQuda 1.722216e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda

quda 00656fb

# TM_QUDA: Time for MG_Preconditioner_Setup 1.255415e+02 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/MG_Preconditioner_Setup
# TM_QUDA: Time for invertQuda 3.789582e+01 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 3.040175e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio3light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for MG_Preconditioner_Setup_Update 3.156509e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/MG_Preconditioner_Setup_Update
# TM_QUDA: Time for invertQuda 2.764697e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio2light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda

The test was done in

/m100_work/INF22_lqcd123_0/mgarofal/runs/m100_2_nodes_1gpu_per_node_updateMultigridQuda_regression_Jun_2022_cA211_30_32

in the two subdirectories

tmLQCD_164c415dd68faa22d34feb0ea514220413b9c448_quda_00656fb63cf33e4b2fa559ec238d1cdf6ba8de5c
tmLQCD_164c415dd68faa22d34feb0ea514220413b9c448_quda_be1766a519995f8c16b9541c300f883efabc3df8

good: be1766a_profile_0.zip bad: 00656fb_profile_0.zip

kostrzewa commented 2 years ago

Thanks @Marcogarofalo. So on Volta + Power9 we appear to have performance regressions not only in updateMultigridQuda, but also in the MG solver itself.

Could you please check if the CG for cloverdetratio1light was also slower?

Finally, just to make sure: there was no tuning here, correct? These were the second runs (or second trajectories), where everything had already been tuned, right?

Marcogarofalo commented 2 years ago

Hi, the CG used in cloverdetratio1light has the same speed

quda be1766a

# TM_QUDA: Time for invertQuda 5.287022e-01 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 2.109095e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 1.758500e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 2.057996e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 2.142247e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda

quda 00656fb

# TM_QUDA: Time for invertQuda 5.472805e-01 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_heatbath/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 2.132907e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 2.027382e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 2.224253e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda
# TM_QUDA: Time for invertQuda 1.733319e+00 s level: 4 proc_id: 0 /HMC/cloverdetratio1light:cloverdetratio_derivative/solve_degenerate/invert_eo_degenerate_quda/invertQuda

Finally, just to make sure: there was no tuning here, correct? These were the second runs (or second trajectories), where everything had already been tuned, right?

I think the tuning was already done, I checked again with a second run and I observed similar performances.

kostrzewa commented 2 years ago

Hi, the CG used in cloverdetratio1light has the same speed

Thanks! I was asking because of the ~ factor of 2 in the MG setup.

I think the tuning was already done, I checked again with a second run and I observed similar performances.

Excellent. In that case we can be pretty sure that we are indeed observing a regression also in the MG solver (at least on Volta + Power9).

mathiaswagner commented 2 years ago

@kostrzewa / @Marcogarofalo I found an issue introduced with #1228 and have a fix in the hotfix/shmem_exchangeghost_regression branch. Do you mind giving that a spin? I don't have a good reproducer for your issue yet but noticed that during re-reviewing the code changes.

kostrzewa commented 2 years ago

Do you mind giving that a spin?

Compiling now on Juwels Booster. I should be able to get back to you in a few hours.

kostrzewa commented 2 years ago

hotfix/shmem_exchangeghost_regression does the trick on Juwels Booster. In fact, updateMultigridQuda is about 20% faster now for the 32c64 test case on two GPUs. Thanks!

maddyscientist commented 2 years ago

Excellent news 😄

bjoo commented 2 years ago

Excellent!!! I shall pull it 😊

Best B

-- Balint Joo, Oak Ridge Leadership Computing Facility, Oak Ridge National Laboratory P.O. Box 2008, 1 Bethel Valley Road, Oak Ridge, TN 37831, USA email: joob AT ornl.gov. Tel: +1-757-912-0566 (cell, remote)

From: Bartosz Kostrzewa @.> Reply-To: lattice/quda @.> Date: Friday, June 24, 2022 at 11:40 AM To: lattice/quda @.> Cc: Balint Joo @.>, Mention @.> Subject: [EXTERNAL] Re: [lattice/quda] updateMultigridQuda performance regression sometime after be1766a519995f8c16b9541c300f883efabc3df8 (Issue #1287) Resent-From: @.> Resent-Date: Friday, June 24, 2022 at 11:40 AM

hotfix/shmem_exchangeghost_regression does the trick on Juwels Booster. In fact, updateMultigridQuda is about 20% faster now for the 32c64 test case on two GPUs. Thanks!

— Reply to this email directly, view it on GitHubhxxps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_lattice_quda_issues_1287-23issuecomment-2D1165695457&d=DwMCaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=WG01dcOd8ZFLFFNkrpbj6A&m=Zw0zbqBzNF7GnhrvEDSbvOExq3dGBhHgbKfill9MjsgUx92qsLMr51Sc0iK9lWQZ&s=KdRwfeEWGI5H6JeiJZ_4TyuVSfTSyXKtkYjSmUBff0g&e=, or unsubscribehxxps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAEPL2FTGCA6CXJXZ6HXMF3VQXJE5ANCNFSM5X4Z7X6A&d=DwMCaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=WG01dcOd8ZFLFFNkrpbj6A&m=Zw0zbqBzNF7GnhrvEDSbvOExq3dGBhHgbKfill9MjsgUx92qsLMr51Sc0iK9lWQZ&s=GLvUXDd_g51c9xtjkeihBnKbcjYHq6KuocjymFgTYVU&e=. You are receiving this because you were mentioned.Message ID: @.***>

lattice / quda

updateMultigridQuda performance regression sometime after be1766a519995f8c16b9541c300f883efabc3df8 #1287

good

bad

1260 (@mathiaswagner): merge 345b3335c7b3c8d61812c608730bc2696679ef04

1257 (@maddyscientist): merge 2f9d800e47ff80a1d5986fcf20767c561928d4b9

1263 (@mathiaswagner): merge 8e245159ebba8febfb8e493d1e3a2e98cfd7fbbb

1264 (@weinbe2): merge 0123a9e1d9422ee8b6b3b1a48307b881a6cac53b

1228 (@mathiaswagner) (specifically between commits be1766a519995f8c16b9541c300f883efabc3df8 and b50f16a55d2eaae816c00812b3996138c5c4dfa9)