Closed jcosborn closed 8 months ago
@jcosborn it doesn't appear that the ThreadLocalCache
is decomposing the underlying type to ensure memory coalescing when accessing the shared memory on the CUDA target. This will result in shared memory bank conflicts and likely significant drops in performance versus. Am I reading this correctly?
Yes, it doesn't do coalescing now. I had thought you said it wasn't necessary here, but maybe I misunderstood. I can easily add that back in.
I likely misunderstood, IIRC we discussed that while I was in the Houston convention center with a lot of background noise 😄
If you can add it back, that would be great, thanks.
Coalescing is back in now. The interface had to change a bit since subscripting can't return a reference anymore.
This is somewhat redundant with SharedMemoryCache now. One option is to factor out a common shared memory base that they and thread array could all inherit from. If that seems desirable, I can work on that before copying over to HIP.
Hi @jcosborn. I think your suggestion of factoring out the common shared memory base is a good one. Also good to add a generic version of the ThreadLocalCache
that doesn't use shared memory and just uses the stack, and verify that it functionally works.
@jcosborn out of curiosity have you thought about/tested what happens if you have both a ThreadLocalCache
and a SharedMemoryCache
in the same kernel? I don't think there are any kernels that use both, nor can I think of a case for using such functionality on the horizon, but it'd be good to understand in advance at least.
To be clear---I'm not going to block this PR on the possibility that they can't be simultaneously used in the same kernel, but it'd be good to understand in advance (and file an issue on it so we have it on record).
@weinbe2 It should work just fine to use both in the same kernel. By default they would share the same region of shared memory. If they need to use separate memory, then one would need to specify the other as the offset type, similar to what is done here: https://github.com/lattice/quda/pull/1380/files#diff-ca4a459b7ab46afbc002b7039de573e2868bd29ef1459c6ca28fdf91fb27757aR139
The gauge-force test is failing on my Volta:
11: link_precision link_reconstruct space_dim(x/y/z) T_dimension Gauge_order niter
11: single 18 2/4/6 8 qdp 100
11: ERROR: qudaLaunchKernel returned an illegal memory access was encountered
11: (/home/kate/github/quda-mg/lib/targets/cuda/quda_api.cpp:152 in qudaLaunchKernel())
11: (rank 0, host nvsocal2, quda_api.cpp:58 in void quda::target::cuda::set_runtime_error(cudaError_t, const char *, const char *, const char *, const char *, bool)())
11: last kernel called was (name=N4quda10ForceGaugeIfLi3EL21QudaReconstructType_s18ELb1EEE,volume=2x4x6x8,aux=GPU-offline,vol=384stride=192precision=4geometry=4Nc=3r=0000,num_paths=48,comm=0000)
1/1 Test #11: gauge_path_single ................***Failed 2.86 sec
All tests working for you @jcosborn ?
It works on my desktop with GeForce. How is your test different from the ones in CI?
I think I've just discovered the issue: the prior implementation of thread_array
used static shared memory, and not dynamic. So kernels are being launched without having set the required shared memory.
Adding unsigned int sharedBytesPerThread() const { return 4 * sizeof(int); }
to gauge_force.cu fixes the issue.
Getting a subsequent error that I'm now tracking down....
I'm just running ctest
....
And I've discovered a problem kernel: the improved stout kernel uses both thread_array
and ThreadLocalCache
, with overlap between these allocations since the thread_array
is no longer placed in shared memory and is unaware of any offsets.
And I've confirmed that all issues go away if I make thread_array
a "dumb" thread array (e.g., do not use shared memory) and ctest
passes without issue.
Ok, glad that you found this. I'll fix it in the next update. With kernel tagging, errors like this would be harder to miss. How did you manage to find this when all the CI tests passed?
Simple: they didn’t all pass 😄
In case it wasn't clear, I'm referring to the run-time tests, not the build-time tests. E.g., with a simple staggered build:
QUDA_RESOURCE_PATH=. ctest
Test project /scratch2/kate/quda-mg
Start 1: blas_test_parity_staggered
1/15 Test #1: blas_test_parity_staggered .............. Passed 4.25 sec
Start 2: blas_test_full_staggered
2/15 Test #2: blas_test_full_staggered ................ Passed 4.63 sec
Start 3: blas_interface_test
3/15 Test #3: blas_interface_test ..................... Passed 3.14 sec
Start 4: dslash_staggered_matpc_policytune
4/15 Test #4: dslash_staggered_matpc_policytune ....... Passed 3.00 sec
Start 5: dslash_staggered_mat_policytune
5/15 Test #5: dslash_staggered_mat_policytune ......... Passed 2.78 sec
Start 6: benchmark_dslash_staggered_policytune
6/15 Test #6: benchmark_dslash_staggered_policytune ...***Not Run (Disabled) 0.00 sec
Start 7: dslash_asqtad_matpc_policytune
7/15 Test #7: dslash_asqtad_matpc_policytune .......... Passed 3.03 sec
Start 8: dslash_asqtad_mat_policytune
8/15 Test #8: dslash_asqtad_mat_policytune ............ Passed 2.94 sec
Start 9: benchmark_dslash_asqtad_policytune
9/15 Test #9: benchmark_dslash_asqtad_policytune ......***Not Run (Disabled) 0.00 sec
Start 10: dslash_asqtad_build_policytune
10/15 Test #10: dslash_asqtad_build_policytune .......... Passed 5.33 sec
Start 11: gauge_path_single
11/15 Test #11: gauge_path_single .......................***Failed 2.90 sec
Start 12: unitarize_link_single
12/15 Test #12: unitarize_link_single ................... Passed 3.77 sec
Start 13: gauge_alg_single
13/15 Test #13: gauge_alg_single ........................ Passed 3.18 sec
Start 14: dilution_test_single
14/15 Test #14: dilution_test_single .................... Passed 3.39 sec
Start 15: tune_test
15/15 Test #15: tune_test ............................... Passed 3.13 sec
where gauge_path_test
is the offender.
@jcosborn I have pushed some fixes:
sharedBytesPerThread()
for the kernels that use thread_array
The main issue that is left is with improved Stout and Symanzik improved Wilson Flow smearing. These kernels use both ThreadLocalCache
and thread_array
(in gauge_utils.cuh
) and are unaware of each other at present, and so end up with overlapping shared memory allocations. Perhaps you can think about how to resolve this. To test for these kernels use:
tests/su3_test --dim 16 16 16 16 --su3-smear-type ovrimp-stout
and
tests/su3_test --dim 16 16 16 16 --su3-smear-type symanzik
@jcosborn I have pushed some fixes:
- Some legacy issues with the autotuning for some kernel types when using dynamic shared memory
- Correct
sharedBytesPerThread()
for the kernels that usethread_array
The main issue that is left is with improved Stout and Symanzik improved Wilson Flow smearing. These kernels use both
ThreadLocalCache
andthread_array
(ingauge_utils.cuh
) and are unaware of each other at present, and so end up with overlapping shared memory allocations. Perhaps you can think about how to resolve this. To test for these kernels use:tests/su3_test --dim 16 16 16 16 --su3-smear-type ovrimp-stout
and
tests/su3_test --dim 16 16 16 16 --su3-smear-type symanzik
@maddyscientist I think the issue is the overimproved routines create ThreadLocalCache
s, for ex:
ThreadLocalCache<Link> Stap;
ThreadLocalCache<Link,0,decltype(Stap)> Rect; // offset by Stap type to ensure non-overlapping allocations
But then the calls to computeStaple
or computeStapleRectangle
, defined in include/kernels/gauge_utils.cuh
create thread_array<int, 4>
(see https://github.com/lattice/quda/blob/develop/include/kernels/gauge_utils.cuh#L28 ). In the past, that was static shared memory so there was no issue, but now that it's backed by dynamic shared memory it needs to know the offsets, otherwise it'll trample the shared memory also used by Stap
and Rect
(noted above).
Not sure your point @weinbe2, that’s exactly what I was saying above. Unless I’m missing something
Not sure your point @weinbe2, that’s exactly what I was saying above. Unless I’m missing something
Oh that's what happens when I only skim your post above and don't actually read the details. Don't mind me.
@jcosborn maybe I'm missing an implementation detail, but how do you envision handling a case where you need a shared memory cache for three different objects? i.e., for conversations sake, one for a Link
, one for a double
, and one for an int
?
I see you already have ways to handle this in the case where you have multiple Link
s (in the HISQ force, for ex), but I'm not so sure what you'd do in a mixed case --- maybe something like tuple<Link, double>
as the offset for the int
, assuming the tuple
will give an appropriate size (I forget).
I think this fixes the overlapping shared mem issue, though I didn't see it fail before on my GeForce, or on a V100, and I'm not sure why.
@weinbe2 If you look at the fix
https://github.com/lattice/quda/pull/1380/files#diff-ca4a459b7ab46afbc002b7039de573e2868bd29ef1459c6ca28fdf91fb27757aR139
Stap
uses computeStapleRectangleOps
which is array<int,4>
as an offset, and Rect
uses the Stap
type as an offset (which will include its offset as part of the type), so Rect
will be offset by the size of both types.
What tests were you running James when you didn't see failures? Anyway, I've confirmed this fixes the issue at my end.
@weinbe2 If you look at the fix https://github.com/lattice/quda/pull/1380/files#diff-ca4a459b7ab46afbc002b7039de573e2868bd29ef1459c6ca28fdf91fb27757aR139
Stap
usescomputeStapleRectangleOps
which isarray<int,4>
as an offset, andRect
uses theStap
type as an offset (which will include its offset as part of the type), soRect
will be offset by the size of both types.
Understood, thank you, it wasn't immediately clear to me from the implementation that it would "understand" recursive offsets (but I should have looked harder, too).
@maddyscientist Ah, I was just running the test suite, but I see the tests you mention aren't in it. Should they be added?
I think all the comments and issues we discussed are addressed now, and it passes my tests (except for the DW issues I've reported in #1410).
This is ready for review now.
This looks good to me. @weinbe2 can you sign off from your end?
Looks good! I left a few little comments but it should all be straightforward. I'm not sure if this PR still needs a clang-format
.
I'm doing a quick ctest
on my end to cover what the cscs run didn't get through, plus multigrid. Assuming there are no surprises (I'm not particularly worried, but famous last words) this will be good to go.
Did the Ls
hotfix get merged in? I'm seeing some ctest
failures for DWF, all of which look familiar. If it wasn't merged in yet, please merge develop
back in and otherwise disregard what I have below.
If it has been merged in, though, it looks like we may have a fresh issue:
23 - invert_test_mobius_eofa_sym_single (Failed)
---
ESC[0;31m[ FAILED ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[ FAILED ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 8, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[ FAILED ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_half_l2_heavy_quark, where GetParam() = (0, 2, 3, 2, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[ FAILED ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_half_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 2, 1, 8, (-2147483648, -2147483648, -2147483648), 5)
26 - invert_test_domain_wall_double (Failed)
---
ESC[0;31m[ FAILED ] ESC[m6 tests, listed below:
ESC[0;31m[ FAILED ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_double_l2_heavy_quark, where GetParam() = (0, 2, 3, 8, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[ FAILED ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_double_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 8, 1, 8, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[ FAILED ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[ FAILED ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 8, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[ FAILED ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_half_l2_heavy_quark, where GetParam() = (0, 2, 3, 2, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[ FAILED ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_half_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 2, 1, 8, (-2147483648, -2147483648, -2147483648), 5)
In all cases, it looks like it's very close to converging, representative:
Solution = mat_pc, Solve = normop_pc, Solver = cg, Sloppy precision = double
WARNING: CG: Restarting without reliable updates for heavy-quark residual (total #inc 1)
CG: Convergence at 71 iterations, L2 relative residual: iterated = 6.754295e-13, true = 6.754295e-13 (requested = 1.000000e-12), heavy-quark residual = 7.197024e-13 (requested = 1.000000e-12)
Done: 71 iter / 0.010148 secs = 38.7384 Gflops
Residuals: (L2 relative) tol 1.000000e-12, QUDA = 6.754295e-13, host = 1.091589e-12; (heavy-quark) tol 1.000000e-12, QUDA = 7.197024e-13
/scratch/local/quda/tests/invert_test_gtest.hpp:146: Failure
Expected: (rsd[0]) <= (tol), actual: 1.09159e-12 vs 1e-12
MG is good!
The Ls hotfix wasn't in yet. I just merged develop, so that should be there now.
This adds a dedicated thread-local cache object that can use shared memory for storage. It is distinct from SharedMemoryCache in that there is no sharing among threads, which simplifies the interface, and also doesn't need a sync operation available. Since sharing isn't needed, targets can choose to not use shared memory if it is advantageous to do so. Note that thread_array could be replaced by this in the future, but is not being done here.
TODO: add HIP version I'll add HIP once the CUDA version is settled.