lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
289 stars 97 forks source link

"Attempting to bind too large a texture" #692

Closed kostrzewa closed 5 years ago

kostrzewa commented 6 years ago

It seems that I'm having some issues running on PizDaint on few nodes when device memory utilisation is high and local lattice edge lengths are large due to errors of the following kind:

# QUDA: ERROR: Attempting to bind too large a texture 143327232 > 134217728 (rank 0, host nid06365, /users/bartek/code/quda_feature_multigrid/lib/
cuda_gauge_field.cu:145 in createTexObject())
# QUDA:        last kernel called was (name=N4quda14ExtractGhostExIdLi18ELi4ELi3ENS_5gauge11FloatNOrderIdLi18ELi2ELi18EL20QudaStaggeredPhase_s0ELb
1EEEEE,volume=48x48x48x26,aux=prec=8,stride=1437696,extract=0,dimension=3,geometry=4)

This is a simple double-half mixed CG run on a 48c96 lattice (twisted clover with dynamic clover inversion).

The corresponding run on 9 nodes reports device memory usage below 5 GB, which would mean that the problem should just about fit on 3 P100, yet it only seems to fit on 6 nodes and above.

# QUDA: CG: Convergence at 29733 iterations, L2 relative residual: iterated = 2.733731e-10, true = 2.733731e-10
# QUDA: Done: 29733 iter / 187.209 secs = 8298.42 Gflops
# QUDA: time spent in reorder_spinor_fromQuda: 0.040400 secs
# Inversion done in 29733 iterations, squared residue = 1.162918e-11!
# Inversion done in 1.91e+02 sec. 
# QUDA: WARNING: Environment variable QUDA_PROFILE_OUTPUT_BASE is not set; writing to profile.tsv and profile_async.tsv
# QUDA: Saving 64 sets of cached parameters to /users/bartek/local/quda_resources/PizDaint-dynamic_clover-b87de9877010807be613550aac919bf7419d2b24_gdr1/profile_0.tsv
# QUDA: Saving 9 sets of cached profiles to /users/bartek/local/quda_resources/PizDaint-dynamic_clover-b87de9877010807be613550aac919bf7419d2b24_gdr1/profile_async_0.tsv
# QUDA: 
               initQuda Total time = 1.48045 secs
# QUDA:                   init     = 1.480448 secs (   100%), with        2 calls at 7.402240e+05 us per call
# QUDA:      total accounted       = 1.480448 secs (   100%)
# QUDA:      total missing         = 0.000000 secs (     0%)
# QUDA: 
          loadGaugeQuda Total time = 0.165181 secs
# QUDA:               download     = 0.158009 secs (  95.7%), with        1 calls at 1.580090e+05 us per call
# QUDA:                   init     = 0.002309 secs (   1.4%), with        1 calls at 2.309000e+03 us per call
# QUDA:                compute     = 0.004194 secs (  2.54%), with        1 calls at 4.194000e+03 us per call
# QUDA:                   free     = 0.000343 secs ( 0.208%), with        1 calls at 3.430000e+02 us per call
# QUDA:      total accounted       = 0.164855 secs (  99.8%)
# QUDA:      total missing         = 0.000326 secs ( 0.197%)
# QUDA: 
         loadCloverQuda Total time = 0.186086 secs
# QUDA:                   init     = 0.003703 secs (  1.99%), with        4 calls at 9.257500e+02 us per call
# QUDA:                compute     = 0.028532 secs (  15.3%), with        2 calls at 1.426600e+04 us per call
# QUDA:                  comms     = 0.151584 secs (  81.5%), with        1 calls at 1.515840e+05 us per call
# QUDA:                   free     = 0.000000 secs (     0%), with        1 calls at 0.000000e+00 us per call
# QUDA:      total accounted       = 0.183819 secs (  98.8%)
# QUDA:      total missing         = 0.002267 secs (  1.22%)
# QUDA: 
             invertQuda Total time = 187.398 secs
# QUDA:               download     = 0.032032 secs (0.0171%), with        1 calls at 3.203200e+04 us per call
# QUDA:                 upload     = 0.089983 secs ( 0.048%), with        1 calls at 8.998300e+04 us per call
# QUDA:                   init     = 0.005363 secs (0.00286%), with        1 calls at 5.363000e+03 us per call
# QUDA:               preamble     = 0.000001 secs (5.34e-07%), with        1 calls at 1.000000e+00 us per call
# QUDA:                compute     = 187.208740 secs (  99.9%), with        1 calls at 1.872087e+08 us per call
# QUDA:               epilogue     = 0.023570 secs (0.0126%), with        3 calls at 7.856667e+03 us per call
# QUDA:                   free     = 0.002248 secs (0.0012%), with        2 calls at 1.124000e+03 us per call
# QUDA:      total accounted       = 187.361937 secs (   100%)
# QUDA:      total missing         = 0.035611 secs ( 0.019%)
# QUDA: 
                endQuda Total time = 0.072951 secs
# QUDA: 
       initQuda-endQuda Total time = 218.594 secs
# QUDA: 
                   QUDA Total time = 189.302 secs
# QUDA:               download     = 0.190042 secs (   0.1%), with        2 calls at 9.502100e+04 us per call
# QUDA:                 upload     = 0.089983 secs (0.0475%), with        1 calls at 8.998300e+04 us per call
# QUDA:                   init     = 1.491471 secs ( 0.788%), with        8 calls at 1.864339e+05 us per call
# QUDA:               preamble     = 0.000000 secs (     0%), with        1 calls at 0.000000e+00 us per call
# QUDA:                compute     = 187.241468 secs (  98.9%), with        4 calls at 4.681037e+07 us per call
# QUDA:                  comms     = 0.151585 secs (0.0801%), with        1 calls at 1.515850e+05 us per call
# QUDA:               epilogue     = 0.023571 secs (0.0125%), with        3 calls at 7.857000e+03 us per call
# QUDA:                   free     = 0.002591 secs (0.00137%), with        4 calls at 6.477500e+02 us per call
# QUDA:      total accounted       = 189.190711 secs (  99.9%)
# QUDA:      total missing         = 0.111504 secs (0.0589%)
# QUDA: 
# QUDA: Device memory used = 4144.1 MB
# QUDA: Pinned device memory used = 0.0 MB
# QUDA: Page-locked host memory used = 542.5 MB
# QUDA: Total host memory used >= 542.6 MB

Is this a known problem? Am I forgetting about some fields? I've noticed (through usage of nvidia-smi on our local cltster) that the device memory usage output is not always perfectly reliable, so maybe the problem does indeed require 6 cards for all fields and halos.

maddyscientist commented 6 years ago

The issue here is the maximum texture size is 2^27 texels (where one texel equals at most 128 bits). QUDA binds all arrays to textures, and these are used for accessing in the dslash kernels rather than directly loading from arrays. This was never really a problem prior to 16 GiB cards, since the 2^27 texel size was never really broached for a real calculation, but it can be an issue with 16 GiB cards, and absolutely is a limiter on 32 GiB cards.

(Textures are specialized read-only caches utilized in real graphics applications, but aren't really necessary anymore, especially with recent improvements in cache, e.g, Volta's first-class L1 cache.)

The solution here is to disable the use of textures in QUDA, and this is something that can almost be done at the moment, but I need to do a little more plumbing for this is to work throughout the library to make this reality. What I will likely do is expose the use of textures as a cmake option, and allow the user to opt out of textures.

So I won't get to this immediately, but I'll try to get this done within a few weeks....

kostrzewa commented 6 years ago

Thanks for the explanation Kate! With an eye towards backwards-compatibility and performance on older hardware (pre-Volta), would it not perhaps be possible to use multiple textures as cache for a single field?

maddyscientist commented 6 years ago

Multiple textures on a single field probably can't be done cleanly (that I can think of anyway). In theory, barring the for-free conversion from short to float that textures bring when using half precision, there should be very little reason to prefer textures over direct loads on all GPUs after big Kepler (sm_35). So I don't think there's anything to worry about from having a no-texture option, sure there may be slight corner cases where a kernel performs a little bit slower, but compared to the other overheads when strong scaling, it's probably nothing to worry about (certainly nothing to worry about without hard data anyway).

kostrzewa commented 5 years ago

Just to keep up to date, are there any news on this issue?

maddyscientist commented 5 years ago

@kostrzewa this is being worked on as part of the ground-up dslash rewrite. This is presently being worked on in the feature/dwf-rewrite branch. Hopefully less than a month from now, depending on how much time I have to work on this.

kostrzewa commented 5 years ago

That's good to hear! It's currently somewhat of a limiting factor given the poor scaling that one observes on many machines.

kostrzewa commented 5 years ago

@maddyscientist I was wondering if there had been any progress just to keep up to date.

maddyscientist commented 5 years ago

@kostrzewa Sorry for the slow response. I was waiting until I had something solid to give you....

I'm pleased to report that support for large fields is now enabled for Wilson, Wilson-clover, twisted-mass (both singlet and doublet) and twisted-clover is now implemented with my dslash rewrite. The other operators will come in the next few weeks.

All this work is still in the feature/dwf-rewrite branch (which now extremely poorly named since all the dslash operators are being rewritten this branch, and dwf hasn't been done yet). Anyway, if you want to test this, you just need to set the Cmake advance options QUDA_TEX=OFF and QUDA_LEGACY_DSLASH=OFF (the latter has OFF as the default already). This will enable the use of direct loads without using textures at all. Initial testing of this seems to work, but I only just implemented it right now. Feel free to test and give feedback.

I'd estimate that we're looking at 1 month until this branch will be merged into develop. Most of the work has been done, but we need to finish up the remaining operators, and test this thoroughly, since this branch represents a significant change to QUDA. This branch will see QUDA approximately half in size.

One interesting thing I've noticed with the twisted-clover rewrite, that I cannot explain yet. As we've discussed in the past, half precision twisted-clover has issues if dynamic clover wasn't used, and convergence would often break down. With my rewrite, this isn't occurring, and convergence seems much more stable (@AlexVaq as an FYI).

kostrzewa commented 5 years ago

@maddyscientist That's excellent news. Will try it right away!

kostrzewa commented 5 years ago

Thanks a lot for this, you've effectively doubled our computing time budget! :)

I can also confirm that dynamic clover is no longer necessary for double-half twisted clover. While this doesn't quite offset the O(10-15%) slowdown that I see with the non-texture kernels in double-half, I would guess that it offsets some of that slowdown. In fact, in the non-texture code, I now observed convergence issues with dynamic clover enabled in one case which worked fine on some number of cards but did not on a different number of cards. Without dynamic clover, I don't seem to have any convergence issues anymore.

Also in double-single as well as in the MG (despite the half precision null vectors), I see no more than a 10% slowdown on P100.

cg

2 P100, double-single CG, textures ON, dynamic clover ON

# QUDA: CG: 8000 iterations, <r,r> = 5.424126e-12, |r|/|b| = 3.690134e-10
# QUDA: WARNING: Exceeded maximum iterations 8000
# QUDA: CG: Reliable updates = 3
# QUDA: CG: Convergence at 8000 iterations, L2 relative residual: iterated = 3.690134e-10, true = 3.690123e-10
# QUDA: Solution = 1.44016e+08
# QUDA: Reconstructed: CUDA solution = 2.08396e+08, CPU copy = 2.08396e+08
# QUDA: Done: 8000 iter / 54.724 secs = 1509 Gflops

2 P100, double-single CG, textures OFF, dynamic clover OFF

# QUDA: CG: 8000 iterations, <r,r> = 1.404947e-12, |r|/|b| = 1.878094e-10
# QUDA: WARNING: Exceeded maximum iterations 8000
# QUDA: CG: Reliable updates = 3
# QUDA: CG: Convergence at 8000 iterations, L2 relative residual: iterated = 1.878094e-10, true = 1.878096e-10 (requested = 3.162278e-11)
# QUDA: Solution = 1.41752e+08
# QUDA: Reconstructed: CUDA solution = 2.03877e+08, CPU copy = 2.03877e+08
# QUDA: Done: 8000 iter / 60.6986 secs = 1151.44 Gflops

MG

4 P100, double-single 3 level MG, textures ON, dynamic clover ON

GCR: 42 iterations, <r,r> = 3.302192e-14, |r|/|b| = 2.878669e-11
GCR: number of restarts = 2
GCR: Convergence at 42 iterations, L2 relative residual: iterated = 2.878676e-11, true = 2.878676e-11
Solution = 1.55175e+08
Reconstructed: CUDA solution = 2.30622e+08, CPU copy = 2.30622e+08
# QUDA: Done: 42 iter / 1.53866 secs = 1711.3 Gflops

4 P100, double-single 3 level MG, textures OFF, dynamic clover OFF

GCR: 41 iterations, <r,r> = 3.650660e-14, |r|/|b| = 3.026778e-11
GCR: number of restarts = 2
GCR: Convergence at 41 iterations, L2 relative residual: iterated = 3.026778e-11, true = 3.026778e-11 (requested = 3.162278e-11)
Solution = 1.54236e+08
Reconstructed: CUDA solution = 2.28804e+08, CPU copy = 2.28804e+08
# QUDA: Done: 41 iter / 1.61817 secs = 1433.26 Gflops

As a result, being able to run on fewer cards really essentially doubles what one can achieve with a given budget. On 32GB cards, I guess that the effect will be almost a factor of four unless the different balance between CPU / GPU resources severely affects overall application efficiency.

kostrzewa commented 5 years ago

While results on our cluster are nice, I'm facing divergence issues on PizDaint even in plain double precision CG with our without dynamic clover as well as with and without GDR. This is for a 64c128 lattice for which we employ 4D comms in a 2228 (xyzt) topology, in contrast to the tests on our cluster which refer to a 32c64 lattice with 1D comms and 1112 or 1114 topologies, so maybe some residual issues with halos are the culprit. If I find some time, I will try to run a test on PizDaint also with a smaller lattice and the 1D topology.

maddyscientist commented 5 years ago

Hi @kostrzewa. That's great to hear that this is proving useful already. Thanks for the report on how this doing.

The lack of convergence on Piz Daint, and also the occasional failure of dynamic clover on your home cluster though isn't good news. Since I have an account on Piz Daint, if you can give me a simple reproducer for the issue, I could look at this directly and get it fixed. Certainly, I haven't tested this new branch at scale yet, only on my workstation, so it's definitely possible there are things to fix.

On the difference in performance of with and without textures. In the performance comparisons I've done so far, and with recent optimization, I am not seeing any performance regressions on texture vs non-texture on the dslash kernels themselves, though I have not yet looked at the blas kernels nor, have I multigrid components. Generally I've found on Pasal that the performance is about tied, and on Volta there seems to be a slight performance advantage to not using textures. Can you send me the generated profile files from the non-texture and texture runs so I can see where any performance regressions are coming from? One thing I do have to fix is that at present the same tunecache entries will be used regardless if QUDA_TEX is enabled or not. So you need to make sure that you use a separate tunecache for these two compilation trajectories.

Moreover, for twisted-clover, using dynamic clover seems to be mostly a win in terms of performance versus non-dynamic (since the new dynamic clover kernels use direct forward/backward substitution as opposed explicit inverse construction and multiplication). In other words, it might be worth trying non-textures with dynamic clover.

On the multigrid results, this is looking like a good solid speedup versus CG. Can you send me the parameters you are using here? There has been a variety of improvements in multigrid since last we spoke, so it would be good to get you using the latest and greatest smoothers, etc. E.g., the optimal smoother is now CA-GCR which if configured correctly only involves a single reduction for the entire application.

maddyscientist commented 5 years ago

@kostrzewa I think the bad convergence issues should all have been fixed now (pushed to feature/dwf-rewrite branch). There was indeed an issue with the non-texture builds, where the MPI communicators were being erroneously being freed and reallocated while a dslash was communicating. This bug would have affected all non-texture builds, regardless of machine type (e.g., number of GPUs per node, or partitioning strategy) and the fact that it worked on small dense node systems was not a statement of correctness, rather dumb luck.

Anyway, in fixing this bug I have extended the regression tests to now run at arbitrary node count, and partitioning strategy, and this should help isolate any specific machine issues going forward.

maddyscientist commented 5 years ago

One more thing, this bug would not just have been a correctness bug, but also a performance bug, since the communicators were being continuously reallocated. Hence any performance regressions you may have seen in comparing tex and non-tex builds may have simply been this.

I would further assert that optimal twisted-clover performance will be had when enabling dynamic-clover inversion, and any badness you saw with that was just the correctness bug I just fixed.

kostrzewa commented 5 years ago

@maddyscientist Thanks a lot for the updates! I will have some time on the 12th of February to take a look again whether the fixes have resolved the issues that I've seen. As for the test case, this was an inversion on the 64c128 ensemble at the physical point that I shared with you sometime around the time of the hackathon, running on 64 nodes and hopefully now with the changes on 32 nodes. I can provide you with more details on the exact parameters that I used when I have a moment to relaunch the job with verbose output (since this goes through the tmLQCD interface, I can't straightforwardly relate it to the command line flags for invert_multigrid_test).

kostrzewa commented 5 years ago

As for CA-GCR, @pittlerf and I tested this some time ago (https://github.com/etmc/tmLQCD/commit/812212c063dc000dff65b9ce2f9dbb8c866e3a93) and were not able to see any benefit compared to MR. However, this was on our local cluster and I don't remember if we also did multi-node tests (where I would expect some benefit). We will certainly test this again also on Piz Daint.

maddyscientist commented 5 years ago

Great, please let me know how it all works. Hopefully well 😄. We have been dramatically increasing the testing coverage of the new code, across a variety of different MPI libraries and clusters, and I think we're in very good shape now.

On CA-GCR, there are a few parameters that you need to set to ensure it works well. The best I've found is it use CA-GCR(8) as a post smoother and use no pre smoothing. Also, when compiling QUDA, you'll need to ensure that QUDA_MAX_MULTI_BLAS_N (advanced cmake option) is set to N+1, where N is the is size of the generated Krylov space (e.g., for CA-GCR(8), N=8): doing so means that a single fused kernel will handle the entire reduction needed for each Krylov space generation. Similarly, for the coarsest grid solver, you can use CA-GCR(8) which dramatically reduces the cost of the coarsest grid solver which is completely latency bound. I've found an overall 30-40% speedup in the overall multigrid solver from using this.

The main bad side of increasing QUDA_MAX_MULTI_BLAS_N is the compilation time of multi_reduce_quda.cu and multi_blas_quda.cu. 😞

kostrzewa commented 5 years ago

@maddyscientist Thank you, it all seems to work now! See further below for some timings from MG parameter scans. Going from 64 to 32 nodes for a 64c128 lattice at the physical point basically results in cost savings of a factor of 2! Performance on 64 nodes is comparable to the legacy Dslash, although it does seem a tiny bit slower (a few % at most).

We still have the problem that QUDA_ENABLE_GDR=1 leads to wrong residuals on PizDaint with the MG and results in a non-converging solver with CG. Since performance is quite good without GDR, we currently simply run without. I guess this might be a Cray problem?

I also gave CA-GCR another try as a smoother. We don't use any pre-smoothing anyway as we find that it doesn't help at all for our lattices. In our case it seems to take us from 11.8 to 9.9 seconds on 32 nodes on the aforementioned lattice.

I've set up QUDA with QUDA_MAX_MULTI_BLAS_N=9 and set CA-GCR as a post-smoother, but I'm wondering if there's anything else that I need to do in order to make sure that what is used is actually CA-GCR(8). (i.e., do I need to modify some inv_param.gcrNkrylov or inv_param.pipeline or are there new parameters which control the N? )

So now to the timings from MG parameter scans to find optimum time to solution (tts).

These are inversions using clover-twisted-mass at the physical point on a 64c128 lattice. No gauge compression is used as we use theta boundary conditions in time in tmLQCD. Parallelisation is (xyzt) 1-1-4-8 on 32 nodes, 1-1-4-16 on 64 nodes. Aggregation parameters are (fine, intermediate)

X = (4,4) Y = (4,4) Z = (4,2) T = (4,2)

Most of the parameters in the table below should be familiar. blockxy2 is the aggregation size on the intermediate level in the X and Y dimensions (I tested 2 and 4 and always obtain lowest tts with the larger one). solvercagcr == 1 means that CA-GCR is used on the coarsest level. smoothercagcr == 3 means that CA-GCR is used as a smoother on all levels.

Fastest tts without CA-GCR
## 32 nodes
  n_level coarsemu setupiter setuptol nupre nupost smoothertol coarsetol
1       3       50      1000    5e-07     0      4         0.2       0.3
  coarseiter omega nvec blockxy2 solvercagcr smoothercagcr iters    tts
1        500  0.85   24        4           0             0    80 11.748
## 64 nodes
  n_level coarsemu setupiter setuptol nupre nupost smoothertol coarsetol
1       3       50      1000    5e-07     0      4         0.2       0.3
  coarseiter omega nvec blockxy2 solvercagcr smoothercagcr iters     tts
1        500  0.85   24        4           0             0    78 9.65095

Fastest tts overall
## 32 nodes
  n_level coarsemu setupiter setuptol nupre nupost smoothertol coarsetol
1       3       50      1000    5e-07     0      4         0.4       0.3
  coarseiter omega nvec blockxy2 solvercagcr smoothercagcr iters     tts
1        500  0.85   24        4           1             3    75 9.93065
## 64 nodes
  n_level coarsemu setupiter setuptol nupre nupost smoothertol coarsetol
1       3       50      1000    5e-07     0      4         0.4       0.3
  coarseiter omega nvec blockxy2 solvercagcr smoothercagcr iters     tts
1        500  0.85   24        4           0             3    74 8.30854

I still need to test on our local cluster to see if the findings translate to a dense system.

kostrzewa commented 5 years ago

Although perhaps this issue is not the best venue for it, I had a question since we are discussing MG performance. I've found that performance depends rather extremely on the chosen parallelisation, beyond what I would expect from packing and unpacking overheads.

As an example, when I compare the performance of the inversion above on 32 nodes in a parallelisation (xyzt) of 1-1-4-8 and 1-1-2-16, the latter is slower by a factor of three. Now, given that this is a 64c128 lattice, 128 divided by 16 (MPI tasks in T), then by 4 (fine block size in T), then by 2 (intermediate block size in T) gives 1. Thus, there's only a single lattice site in the T dimension on coarsest level in this case. In the other case, i.e. 1-1-4-8, there are two lattice sites left on the coarsest level. Could that be the reason or would you rather say that this is a Cray-specific issue? I haven't really explored this in depth on our dense node cluster.

bjoo commented 5 years ago

Hi Bartosz, I am not the authoritative expert here, so wjat I write below maybe nonsense but I thought that the blocking needs to be so that the coarsest level is at least 2^4 and that it has to be even in every dim. So a coarsest local Lt=1 sounds suspect to me. In the output where the blockings are listed by QUDA, does it respect your blocking?

Best, B

kostrzewa commented 5 years ago

Hi Balint, I thought that as long as the innermost dimension on the coarsest lattice is even and as long as the total number of coarse lattice sites (per MPI rank) is even, one should be fine. If what you're saying is true, QUDA might indeed simply change the blocking silently (see below). The solver works fine, we check the residuals externally.

I have, however, also seen the same kind of parallelisation sensitivity in the other direction. I.e., if I parallelise 2-2-2-4 on 32 Piz Daint nodes with the same kind of blocking listed in https://github.com/lattice/quda/issues/692#issuecomment-463143472, the resulting algorithm is also about a factor of three slower than the fastest setup I could find with the 1-1-4-8 parallelisation and the difference in performance does not seem to be affected much by all the other tunable parameters (absolute performance, is of course affected, but the relative difference between the two parallelisations does not seem to depend much on the other details). For the 2-2-2-4 case, I explained this by the packing / unpacking overheads in X and Y, but I could not find a ready explanation for the 1-1-2-16 parallelisation.

In the output where the blockings are listed by QUDA, does it respect your blocking?

Good question. Since this is wrapped within tmLQCD, I guess I would need to adjust the verbosity levels for the MG setup to see this. We currently only set the verbosity of the outer solver and QUDA as a whole.

weinbe2 commented 5 years ago

I thought that the blocking needs to be so that the coarsest level is at least 2^4 and that it has to be even in every dim.

Aye, both parts of that are correct. QUDA doesn't support an odd value in any dimension.

If what you're saying is true, QUDA might indeed simply change the blocking silently (see below).

It does in some cases, though it'll print a warning: https://github.com/lattice/quda/blob/develop/lib/transfer.cpp

Line 30 to 43, specifically you can see it trying a smaller size on line 40. An example of where this will happen is if you have a 4^4 volume, then try to block by 4 in each dimension: it'll test the x dimension first, see 4 [volume] / 4 [aggregation size] is 1, which isn't valid, so then it'll reduce the aggregation size to 2, note that 4 [volume] / 2 [aggregation size] is 2, which is fine, and it'll use that instead. Repeat this for every dimension, and QUDA ends up using an aggregation size of 2^4 even though you requested 4^4.

tl;dr: grep for WARNING :)

maddyscientist commented 5 years ago

Great to hear that things are mostly working now, and 32 GPUs is working well for you. There are still some optimizations to go into the non-texture path for the blas routines in half precision, so some regressions when using half precision without textures are not completely unexpected. I'll look at this properly once all the dslash kernels are rewritten in the new framework.

On GDR on Piz Daint. While our unit tests seem to be working perfectly on OpenMPI with GDR on an Infiniband cluster, we're seeing failures also on Piz Daint. Not solver failures, rather after some arbitrary amount of time, MPI seems to break with an error that it cannot allocate more memory. So I think there is some bug there, whether it be in Cray's MPI or QUDA, I can't say at the moment. But it seems very robust on OpenMPI. We'll keep investigating at this end, to determine whose fault it is. With a local volume of 32^4 per GPU, the performance should be good even without GDR.

For solver parameters, you could set the outer solver pipeline parameter to 8. This will reduce the reductions in the outer GCR, and this should help scaling as you increase Nkrlyov. Depending on what value you have at the moment for the outer Krylov size, I'd suggest perhaps gcrNkrlyov=30 and reliable_delta=1e-5 for the outer solver. Having a relatively large outer Krylov size may allow the convergence rate to increase making it worthwhile.

On the unexpected performance for different partitioning. I think this is probably explained by QUDA changing block sizes behind the scenes. As @weinbe2 says, there will be warnings issued when it does this, and these should be printed as long as you don't have silent verbosity (which you may do for the coarse levels). The restriction is that each dimension must be even, and if it can't be divided any more, then there will be no coarsening in a given dimension. So 1x1x2x16 on 32 GPUs will have a fine grid of size 64x64x32x8, and first coarse grid of size 16x16x8x2 and a second coarse grid of size 4x4x4x2. I imagine for the case of twisted-clover where you have to spend a lot of time on the coarse grid vs regular clover, this different becomes quite significant. Moveover, there is the surface-to-volume difference on the fine grid, that a 1x1x2x16 partitioning has more surface versus a 1x1x4x8 partitioning ( (8 + 32) / (16 + 16) ). So I think this all makes sense.

For the 2x2x2x4 parallelization, yes, although the communicated surface area is the same, this is expected to perform less well since it requires we partition the X dimension which leads to strided memory access patterns. There is also the additional latency of dealing with multiple dimensions. This is something that I have to work on to reduce the overheads of doing this, but requires that I finish the new dslash framework first.

Ok, I think I answered all the questions, though let me know if I forgot something 😉

kostrzewa commented 5 years ago

@weinbe2 @maddyscientist Thanks a lot for all the explanations. I guess we have the warnings suppressed due to the verbosity level for the different MG levels, which we don't set (I presume it then defaults to silent). I will add some logic to our QUDA interface to make sure that a blocking which results in an odd local lattice dimension is rejected.

Having said all this, when I adjust the parallelisation on 64 nodes to 1x2x4x8, the solver on 32 nodes is still just as good, so the conclusion above still holds and we save about a factor of two using the non-texture pathway.

maddyscientist commented 5 years ago

Closed with #776