add ThreadLocalCache - Githubissues

jcosborn commented 1 year ago

This adds a dedicated thread-local cache object that can use shared memory for storage. It is distinct from SharedMemoryCache in that there is no sharing among threads, which simplifies the interface, and also doesn't need a sync operation available. Since sharing isn't needed, targets can choose to not use shared memory if it is advantageous to do so. Note that thread_array could be replaced by this in the future, but is not being done here.

TODO: add HIP version I'll add HIP once the CUDA version is settled.

maddyscientist commented 1 year ago

@jcosborn it doesn't appear that the ThreadLocalCache is decomposing the underlying type to ensure memory coalescing when accessing the shared memory on the CUDA target. This will result in shared memory bank conflicts and likely significant drops in performance versus. Am I reading this correctly?

jcosborn commented 1 year ago

Yes, it doesn't do coalescing now. I had thought you said it wasn't necessary here, but maybe I misunderstood. I can easily add that back in.

maddyscientist commented 1 year ago

I likely misunderstood, IIRC we discussed that while I was in the Houston convention center with a lot of background noise 😄

If you can add it back, that would be great, thanks.

jcosborn commented 1 year ago

Coalescing is back in now. The interface had to change a bit since subscripting can't return a reference anymore.

This is somewhat redundant with SharedMemoryCache now. One option is to factor out a common shared memory base that they and thread array could all inherit from. If that seems desirable, I can work on that before copying over to HIP.

maddyscientist commented 1 year ago

Hi @jcosborn. I think your suggestion of factoring out the common shared memory base is a good one. Also good to add a generic version of the ThreadLocalCache that doesn't use shared memory and just uses the stack, and verify that it functionally works.

weinbe2 commented 1 year ago

@jcosborn out of curiosity have you thought about/tested what happens if you have both a ThreadLocalCache and a SharedMemoryCache in the same kernel? I don't think there are any kernels that use both, nor can I think of a case for using such functionality on the horizon, but it'd be good to understand in advance at least.

To be clear---I'm not going to block this PR on the possibility that they can't be simultaneously used in the same kernel, but it'd be good to understand in advance (and file an issue on it so we have it on record).

jcosborn commented 1 year ago

@weinbe2 It should work just fine to use both in the same kernel. By default they would share the same region of shared memory. If they need to use separate memory, then one would need to specify the other as the offset type, similar to what is done here: https://github.com/lattice/quda/pull/1380/files#diff-ca4a459b7ab46afbc002b7039de573e2868bd29ef1459c6ca28fdf91fb27757aR139

maddyscientist commented 1 year ago

The gauge-force test is failing on my Volta:

11: link_precision           link_reconstruct           space_dim(x/y/z)              T_dimension        Gauge_order    niter
11: single                       18                         2/4/6                       8                  qdp           100
11: ERROR: qudaLaunchKernel returned an illegal memory access was encountered
11:  (/home/kate/github/quda-mg/lib/targets/cuda/quda_api.cpp:152 in qudaLaunchKernel())
11:  (rank 0, host nvsocal2, quda_api.cpp:58 in void quda::target::cuda::set_runtime_error(cudaError_t, const char *, const char *, const char *, const char *, bool)())
11:        last kernel called was (name=N4quda10ForceGaugeIfLi3EL21QudaReconstructType_s18ELb1EEE,volume=2x4x6x8,aux=GPU-offline,vol=384stride=192precision=4geometry=4Nc=3r=0000,num_paths=48,comm=0000)
1/1 Test #11: gauge_path_single ................***Failed    2.86 sec

All tests working for you @jcosborn ?

jcosborn commented 1 year ago

It works on my desktop with GeForce. How is your test different from the ones in CI?

maddyscientist commented 1 year ago

I think I've just discovered the issue: the prior implementation of thread_array used static shared memory, and not dynamic. So kernels are being launched without having set the required shared memory.

Adding unsigned int sharedBytesPerThread() const { return 4 * sizeof(int); } to gauge_force.cu fixes the issue.

Getting a subsequent error that I'm now tracking down....

I'm just running ctest....

maddyscientist commented 1 year ago

And I've discovered a problem kernel: the improved stout kernel uses both thread_array and ThreadLocalCache, with overlap between these allocations since the thread_array is no longer placed in shared memory and is unaware of any offsets.

maddyscientist commented 1 year ago

And I've confirmed that all issues go away if I make thread_array a "dumb" thread array (e.g., do not use shared memory) and ctest passes without issue.

jcosborn commented 1 year ago

Ok, glad that you found this. I'll fix it in the next update. With kernel tagging, errors like this would be harder to miss. How did you manage to find this when all the CI tests passed?

maddyscientist commented 1 year ago

Simple: they didn’t all pass 😄

maddyscientist commented 1 year ago

In case it wasn't clear, I'm referring to the run-time tests, not the build-time tests. E.g., with a simple staggered build:

QUDA_RESOURCE_PATH=. ctest
Test project /scratch2/kate/quda-mg
      Start  1: blas_test_parity_staggered
 1/15 Test  #1: blas_test_parity_staggered ..............   Passed    4.25 sec
      Start  2: blas_test_full_staggered
 2/15 Test  #2: blas_test_full_staggered ................   Passed    4.63 sec
      Start  3: blas_interface_test
 3/15 Test  #3: blas_interface_test .....................   Passed    3.14 sec
      Start  4: dslash_staggered_matpc_policytune
 4/15 Test  #4: dslash_staggered_matpc_policytune .......   Passed    3.00 sec
      Start  5: dslash_staggered_mat_policytune
 5/15 Test  #5: dslash_staggered_mat_policytune .........   Passed    2.78 sec
      Start  6: benchmark_dslash_staggered_policytune
 6/15 Test  #6: benchmark_dslash_staggered_policytune ...***Not Run (Disabled)   0.00 sec
      Start  7: dslash_asqtad_matpc_policytune
 7/15 Test  #7: dslash_asqtad_matpc_policytune ..........   Passed    3.03 sec
      Start  8: dslash_asqtad_mat_policytune
 8/15 Test  #8: dslash_asqtad_mat_policytune ............   Passed    2.94 sec
      Start  9: benchmark_dslash_asqtad_policytune
 9/15 Test  #9: benchmark_dslash_asqtad_policytune ......***Not Run (Disabled)   0.00 sec
      Start 10: dslash_asqtad_build_policytune
10/15 Test #10: dslash_asqtad_build_policytune ..........   Passed    5.33 sec
      Start 11: gauge_path_single
11/15 Test #11: gauge_path_single .......................***Failed    2.90 sec
      Start 12: unitarize_link_single
12/15 Test #12: unitarize_link_single ...................   Passed    3.77 sec
      Start 13: gauge_alg_single
13/15 Test #13: gauge_alg_single ........................   Passed    3.18 sec
      Start 14: dilution_test_single
14/15 Test #14: dilution_test_single ....................   Passed    3.39 sec
      Start 15: tune_test
15/15 Test #15: tune_test ...............................   Passed    3.13 sec

where gauge_path_test is the offender.

maddyscientist commented 1 year ago

@jcosborn I have pushed some fixes:

Some legacy issues with the autotuning for some kernel types when using dynamic shared memory
Correct sharedBytesPerThread() for the kernels that use thread_array

The main issue that is left is with improved Stout and Symanzik improved Wilson Flow smearing. These kernels use both ThreadLocalCache and thread_array (in gauge_utils.cuh) and are unaware of each other at present, and so end up with overlapping shared memory allocations. Perhaps you can think about how to resolve this. To test for these kernels use:

tests/su3_test --dim 16 16 16 16 --su3-smear-type ovrimp-stout

and

tests/su3_test --dim 16 16 16 16 --su3-smear-type symanzik

weinbe2 commented 1 year ago

@jcosborn I have pushed some fixes:

Some legacy issues with the autotuning for some kernel types when using dynamic shared memory

Correct sharedBytesPerThread() for the kernels that use thread_array

The main issue that is left is with improved Stout and Symanzik improved Wilson Flow smearing. These kernels use both ThreadLocalCache and thread_array (in gauge_utils.cuh) and are unaware of each other at present, and so end up with overlapping shared memory allocations. Perhaps you can think about how to resolve this. To test for these kernels use:
tests/su3_test --dim 16 16 16 16 --su3-smear-type ovrimp-stout
and
tests/su3_test --dim 16 16 16 16 --su3-smear-type symanzik

@maddyscientist I think the issue is the overimproved routines create ThreadLocalCaches, for ex:

      ThreadLocalCache<Link> Stap;
      ThreadLocalCache<Link,0,decltype(Stap)> Rect; // offset by Stap type to ensure non-overlapping allocations

But then the calls to computeStaple or computeStapleRectangle, defined in include/kernels/gauge_utils.cuh create thread_array<int, 4> (see https://github.com/lattice/quda/blob/develop/include/kernels/gauge_utils.cuh#L28 ). In the past, that was static shared memory so there was no issue, but now that it's backed by dynamic shared memory it needs to know the offsets, otherwise it'll trample the shared memory also used by Stap and Rect (noted above).

maddyscientist commented 1 year ago

Not sure your point @weinbe2, that’s exactly what I was saying above. Unless I’m missing something

weinbe2 commented 1 year ago

Not sure your point @weinbe2, that’s exactly what I was saying above. Unless I’m missing something

Oh that's what happens when I only skim your post above and don't actually read the details. Don't mind me.

weinbe2 commented 1 year ago

@jcosborn maybe I'm missing an implementation detail, but how do you envision handling a case where you need a shared memory cache for three different objects? i.e., for conversations sake, one for a Link, one for a double, and one for an int?

I see you already have ways to handle this in the case where you have multiple Links (in the HISQ force, for ex), but I'm not so sure what you'd do in a mixed case --- maybe something like tuple<Link, double> as the offset for the int, assuming the tuple will give an appropriate size (I forget).

jcosborn commented 1 year ago

I think this fixes the overlapping shared mem issue, though I didn't see it fail before on my GeForce, or on a V100, and I'm not sure why.

@weinbe2 If you look at the fix https://github.com/lattice/quda/pull/1380/files#diff-ca4a459b7ab46afbc002b7039de573e2868bd29ef1459c6ca28fdf91fb27757aR139 Stap uses computeStapleRectangleOps which is array<int,4> as an offset, and Rect uses the Stap type as an offset (which will include its offset as part of the type), so Rect will be offset by the size of both types.

maddyscientist commented 1 year ago

What tests were you running James when you didn't see failures? Anyway, I've confirmed this fixes the issue at my end.

weinbe2 commented 1 year ago

@weinbe2 If you look at the fix https://github.com/lattice/quda/pull/1380/files#diff-ca4a459b7ab46afbc002b7039de573e2868bd29ef1459c6ca28fdf91fb27757aR139 Stap uses computeStapleRectangleOps which is array<int,4> as an offset, and Rect uses the Stap type as an offset (which will include its offset as part of the type), so Rect will be offset by the size of both types.

Understood, thank you, it wasn't immediately clear to me from the implementation that it would "understand" recursive offsets (but I should have looked harder, too).

jcosborn commented 1 year ago

@maddyscientist Ah, I was just running the test suite, but I see the tests you mention aren't in it. Should they be added?

jcosborn commented 9 months ago

I think all the comments and issues we discussed are addressed now, and it passes my tests (except for the DW issues I've reported in #1410).

jcosborn commented 9 months ago

This is ready for review now.

maddyscientist commented 9 months ago

This looks good to me. @weinbe2 can you sign off from your end?

weinbe2 commented 9 months ago

Looks good! I left a few little comments but it should all be straightforward. I'm not sure if this PR still needs a clang-format.

I'm doing a quick ctest on my end to cover what the cscs run didn't get through, plus multigrid. Assuming there are no surprises (I'm not particularly worried, but famous last words) this will be good to go.

weinbe2 commented 9 months ago

Did the Ls hotfix get merged in? I'm seeing some ctest failures for DWF, all of which look familiar. If it wasn't merged in yet, please merge develop back in and otherwise disregard what I have below.

If it has been merged in, though, it looks like we may have a fresh issue:

         23 - invert_test_mobius_eofa_sym_single (Failed)
---
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 8, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_half_l2_heavy_quark, where GetParam() = (0, 2, 3, 2, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_half_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 2, 1, 8, (-2147483648, -2147483648, -2147483648), 5)

         26 - invert_test_domain_wall_double (Failed)
---
ESC[0;31m[  FAILED  ] ESC[m6 tests, listed below:
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_double_l2_heavy_quark, where GetParam() = (0, 2, 3, 8, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_double_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 8, 1, 8, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 8, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_half_l2_heavy_quark, where GetParam() = (0, 2, 3, 2, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_half_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 2, 1, 8, (-2147483648, -2147483648, -2147483648), 5)

In all cases, it looks like it's very close to converging, representative:

Solution = mat_pc, Solve = normop_pc, Solver = cg, Sloppy precision = double
WARNING: CG: Restarting without reliable updates for heavy-quark residual (total #inc 1)
CG: Convergence at 71 iterations, L2 relative residual: iterated = 6.754295e-13, true = 6.754295e-13 (requested = 1.000000e-12), heavy-quark residual = 7.197024e-13 (requested = 1.000000e-12)
Done: 71 iter / 0.010148 secs = 38.7384 Gflops
Residuals: (L2 relative) tol 1.000000e-12, QUDA = 6.754295e-13, host = 1.091589e-12; (heavy-quark) tol 1.000000e-12, QUDA = 7.197024e-13
/scratch/local/quda/tests/invert_test_gtest.hpp:146: Failure
Expected: (rsd[0]) <= (tol), actual: 1.09159e-12 vs 1e-12

weinbe2 commented 9 months ago

MG is good!

jcosborn commented 8 months ago

The Ls hotfix wasn't in yet. I just merged develop, so that should be there now.

lattice / quda

add ThreadLocalCache #1380