lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
289 stars 97 forks source link

No binary reproducibility with tuning turned on #182

Closed fwinter closed 9 years ago

fwinter commented 9 years ago

Qudanauts,

I am not seeing binary reproducibility when QUDA tuning is turned on. I may be wrong, but as far as I understand one should get the same result (talking Delta H here) as long as one repeats the trajectory on the same machine partition and makes sure that every one MPI rank gets the same coordinate within the machine grid across runs. (This eliminates the possible difference due to non-associative floats when it comes to adding numbers across nodes.)

I tested my assumption and it seems to hold but only if QUDA has finished tuning. To clarify: I run a short trajectory with only 1 (quite) large step and have set Dslash tuning on:

1traj, 1step, QUDA (latest master, 1d31cbb7), tune on:

rm tunecache

1st run: Delta H = -0.154441864483488 After HMC trajectory call: time= 534.147396 secs

2nd run: Delta H = -0.149310405935012 After HMC trajectory call: time= 386.239995 secs

3rd run: Delta H = -0.149310405935012 After HMC trajectory call: time= 386.19768 secs

rm tunecache

4th run: Delta H = -0.153873271329758 After HMC trajectory call: time= 536.969036 secs

I am repeating the trajectory here to check Chroma + QDP-JIT/NVVM + QUDA correctness. You can see that once tuning has settled after the 1st run, Delta H stays constant from the 2nd to the 3rd run. Removing the tunecache file and thus having QUDA to tune again has an impact on Delta H: The Delta H value from the 4th run seems uncorrelated to all previous one. Thus, it seems I get binary reproducibility only after QUDA doesn't tune anymore (since all kernels are already tuned). I believe this is the only change between e.g. 3rd and 4th run.

This was a 24^3x64 lattice on a 4 x K40m machine (1x1x1x4). I use the MPI rank number to determine the device number for QUDA. Since QMP based the calculation of the node coordinate on the rank number too these runs are completely compareable and should reproduce the same Delta H.

fwinter commented 9 years ago

Maybe I should say that calls are made to the multi-shift CG and the GCR both with the Clover operator.

mathiaswagner commented 9 years ago

Strange. Just curious what would happen for a 5th run. Could you repeat your run and maybe keep the first tunecache so we can check whether the tunecache changed. I guess that should be possible and is expected.

One thing that might be the issue: When tuning there is one last call. Is that call used for the further calculation or is the Kernel called once more with the result of the tuning? If not it would explain the different result on the run that does the tuning.

On Dec 5, 2014, at 17:27, Frank Winter notifications@github.com<mailto:notifications@github.com> wrote:

Qudanauts,

I am not seeing binary reproducibility when QUDA tuning is turned on. I may be wrong, but as far as I understand one should get the same result (talking Delta H here) as long as one repeats the trajectory on the same machine partition and makes sure that every one MPI rank gets the same coordinate within the machine grid across runs. (This eliminates the possible difference due to non-associative floats when it comes to adding numbers across nodes.)

I tested my assumption and it seems to hold but only if QUDA has finished tuning. To clarify: I run a short trajectory with only 1 (quite) large step and have set Dslash tuning on:

1traj, 1step, QUDA (latest master, 1d31cbbhttps://github.com/lattice/quda/commit/1d31cbb7079c2f533ebd88019303c4e977d16f4f), tune on:

rm tunecache

1st run: Delta H = -0.154441864483488 After HMC trajectory call: time= 534.147396 secs

2nd run: Delta H = -0.149310405935012 After HMC trajectory call: time= 386.239995 secs

3rd run: Delta H = -0.149310405935012 After HMC trajectory call: time= 386.19768 secs

rm tunecache

4th run: Delta H = -0.153873271329758 After HMC trajectory call: time= 536.969036 secs

I am repeating the trajectory here to check Chroma + QDP-JIT/NVVM + QUDA correctness. You can see that once tuning has settled after the 1st run, Delta H stays constant from the 2nd to the 3rd run. Removing the tunecache file and thus having QUDA to tune again has an impact on Delta H: The Delta H value from the 4th run seems uncorrelated to all previous one. Thus, it seems I get binary reproducibility only after QUDA doesn't tune anymore (since all kernels are already tuned). I believe this is the only change between e.g. 3rd and 4th run.

This was a 24^3x64 lattice on a 4 x K40m machine (1x1x1x4). I use the MPI rank number to determine the device number for QUDA. Since QMP based the calculation of the node coordinate on the rank number too these runs are completely compareable and should reproduce the same Delta H.

— Reply to this email directly or view it on GitHubhttps://github.com/lattice/quda/issues/182.

rbabich commented 9 years ago

This is definitely a bug and probably indicates an oversight in the definition of a preTune() or postTune() somewhere.

Frank: If you have the patience (or the script-fu), deleting a single line at a time from tunecache.tsv should let you zero in on the problem function.

mathiaswagner commented 9 years ago

Frank, can you also just post the tunecache file that the run created? In addition to Ron's suggestion that should make it easier to track down the issue.

bjoo commented 9 years ago

Hi All, Just a comment from me. I’ve seen similar things on BlueWaters. I put it down to the fact that different runs occur in different conditions in the machine, so that the tuning may well come out different. This is especially important with the reductions. Over an HMC trajectory, even a short one, the differences in rounding error can be blown up due to the chaos in the MD. Typically I found that Delta H-’s with different tuning files had differences at the order of deltaDeltaH, in other words the reversibility violation in the MD, which sort of makes sense considering the error accumulation.

On the other hand in a 4-GPU system with other users, perhaps the tunings between different runs should not be so much affected. Still it may be that different runs produce different tunings depending on the system state.

Mike mentioned that perhaps doing the reductions with the special algorithm that gives them essentially to exact accuracy could solve this issue? For me, I just took it as read that in order for binary exactness, to get the same arithmetic ordering etc, one had better use the same input config, the same input XML (including RNG seeds) and the same tuning file.

This doesn’t mean there is not a bug, its just that I have seen this elsewhere and I could kinda explain it to myself.

Best, B

On Dec 8, 2014, at 7:05 AM, Mathias Wagner notifications@github.com wrote:

Frank, can you also just post the tunecache file that the run created? In addition to Ron's suggestion that should make it easier to track down the issue.

— Reply to this email directly or view it on GitHub.


Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427

email: bjoo@jlab.org

fwinter commented 9 years ago

I did a 5th run (for completeness I am posting the 4th run again)

(rm tunecache) (4th run: Delta H = -0.153873271329758 After HMC trajectory call: time= 536.969036 secs) 5th run: Delta H = -0.150873725817291 After HMC trajectory call: time= 388.751848 secs 6th run: Delta H = -0.150873725817291 After HMC trajectory call: time= 389.196652 secs

We see the same pattern as before in run 1-3. We're not settling to the same Delta H to what we settled before, but I stress that we don't necessarily have to (this depends on how QUDA tunes local reductions, local meaning within 1 MPI process). I compared the tunecache files after each run. They are identical! Thus, once generated after the 4th run, the tuncache file is not altered anymore. Notice however that Delta H does change from run 4 to run 5. I have no explanation for that.

https://www.dropbox.com/s/u8tqfsipo2nbabv/tunecache_1.tsv

This seems weird to me. It looks as if after the tunecache file was read in the 4th run QUDA decides to re-tune a reduction kernel while in the 5th and 6th run (reading the same tunecache file) decides not to do so and go with the cached values. Doesn't make sense to me.

mathiaswagner commented 9 years ago

Frank, does the

(rm tunecache.tsv) line in your explanation mean that the 4th run started without an existing tunecache.tsv? The runtime seems to support that tuning was done in the 4th run. That would explain the delta H again changed for the 5th run and 5 and 6 show the same deltaH.

fwinter commented 9 years ago

Mathias, the 4th run started with no tunecache file present. Given what I wrote I don't see how this can be mis-understood.

You correctly asserted that tuning was done in the 4th run and further concluded that this is the reason why Delta H changed in the 5th run. Please bear with me and share your line of argument because now it's me who doesn't follow.

fwinter commented 9 years ago

I think one has to distinguish between two types of tuning: One that does affect binary reproducibility and one that doesn't. Both, of course, have impact on performance where the latter has impact on performance only. An example for the latter would be tuning a saxpy or Dslash operation; an example for the former would be searching for the optimal hierarchy of recursive reductions for an operation like 'norm2'. In those operations the outcome of tuning is impacted by the non-associativity of floats due to rounding errors.

If I look through the entries in the tunecache file and search for entries that might look like reductions operations I find things like

12x24x24x16 N4quda22HeavyQuarkResidualNormI7double37double2S2_EE vol=110592,stride=110592,precision=8 96 1 1 60 12x24x24x16 N4quda30caxpbypzYmbwcDotProductUYNormYI7double37double2S2_EE vol=110592,stride=110592,precision=4,vol=110592,stride=110592,p 12x24x24x16 N4quda3DotId6float26float4EE vol=110592,stride=110592,precision=4 224 1 1 108 1 1 1792 12x24x24x16 N4quda4CdotI7double26float26float4EE vol=110592,stride=110592,precision=4 96 1 1 131 1 1 12x24x24x16 N4quda5Norm2Id7double2S1_EE vol=110592,stride=110592,precision=8 288 1 1 90 1 1 2304 12x24x24x16 N4quda8xmyNorm2Id7double2S1_EE vol=110592,stride=110592,precision=8 96 1 1 239 1 1 768 12x24x24x16 N4quda9CdotNormAI7double36float26float4EE vol=110592,stride=110592,precision=4 704 1 1 30 1

If in QUDA tuning of reduction operations is done by searching for the highest performance reduction scheme, by testing different orders in writing to shared memory, varying the number of elements per reduction step, etc. and if performance varies from run to run then tuning will inevitably lead to unpredictable rounding errors in the result. These rounding errors will propagate in the MD and lead to difference in Delta H -- for sure.

On the other hand if QUDA determines the hierarchy based on available shared memory only (keeping orders when writing to shmem), e.g. making maximal use of that memory in order to reduce the number of kernel calls, then tuning should have no impact on reproducibility. In that case I don't understand differences in Delta H.

mathiaswagner commented 9 years ago

Sorry, you were completely clear I was only confused by your ‘I have no explanation for that’ and wanted to get overcome my confusion.

Anyway, to go on: My guess is that there is an issue with calling the Kernel with the right arguments after it has been tuned. This can like Ron suggested either be an issue with saving and restoring the arguments after tuning (in preTune and postTune) or something else. That is somewhat modified from my first guess in my first reply on Saturday. Like Ron suggested removing some lines from the tunecache to force retuning of only some Kernels might help to track the Kernel that causes the issue.

What might work is to sort the Kernel by type (i.e. copy, blas, reduction, dslash, other) Kernels and remove them groupwise or do a binary search with forcing always some of the Kernels to be retuned (by providing a tunecache with these Kernels removed). That requires some runs, but maybe with the suggested grouping we can track to down to 3 or four runs.

mathiaswagner commented 9 years ago

Regarding the types of Kernels: I completely agree with you. But given that once the tuning is done you see reproducibility (like you see between run 5 and 6) there is an issue when tuning is active which should not be there. We should be able to get the same result as for runs with existing (and complete tunecache, like for runs 5 and 6) already in tuning run (4).

fwinter commented 9 years ago

Mathias, precisely not! Tuning has finished after the 4th run. This can be drawn from the fact that the tunecache file does not change anymore. And even then: Delta H changes in the 5th run. This is contradicting for me.

Again: Tuning was active in the 4th run. (This is obvious as no cache file is present.) One would assume that no further tuning is happening in the 5th run. The fact that the cache file remains unaltered supports this. However, Delta H has changes again in the 5th run! This seems to tell us that there was some tuning in the 5th run. But this tuning result seemed to have being never written out.

mathiaswagner commented 9 years ago

Just checked with MILC (HISQ) and although to a lesser extent than in Frank's example I see similar effects over 3 runs (first run tuned):

delta S = 3.353286e-01
delta S = 3.353285e-01
delta S = 3.353285e-01
fwinter commented 9 years ago

I took the fully settled tunecache (the one after the 4th run) as a basis. (This file is available at https://www.dropbox.com/s/u8tqfsipo2nbabv/tunecache_1.tsv). The tuning information of the individual kernels are located in this file beginning from line 4 to line 47. Last night a script went through the file, removing one of the tuning lines at a time, and running the same trajectory twice, logging the Delta H's. That is, in each run QUDA found a tuncache file identical to the original one except one removed line. What we consider a bug here is when the Delta H from the 2nd run differs from the 1st run.

The first number determines the line which was removes, the 2nd and 3rd number the Delta Hs and the 4th number the difference.

4 -0.156215933383464 -0.156215933383464 0 5 -0.156215933383464 -0.156215933383464 0 6 -0.149102813020363 -0.148101734375814 .001001078644549 7 -0.156215933383464 -0.156215933383464 0 8 -0.156215933383464 -0.156215933383464 0 9 -0.156215933383464 -0.156215933383464 0 10 -0.156215933383464 -0.156215933383464 0 11 -0.156215933383464 -0.156215933383464 0 12 -0.156215933383464 -0.156215933383464 0 13 -0.156215933383464 -0.156215933383464 0 14 -0.156215933383464 -0.156215933383464 0 15 -0.156215933383464 -0.156215933383464 0 16 -0.156215933383464 -0.156215933383464 0 17 -0.156215933383464 -0.156215933383464 0 18 -0.156215933383464 -0.156215933383464 0 19 -0.156215933383464 -0.156215933383464 0 20 -0.156215933383464 -0.156215933383464 0 21 -0.156215933383464 -0.156215933383464 0 22 -0.156215933383464 -0.156215933383464 0 23 -0.156215933383464 -0.156215933383464 0 24 -0.153149313051927 -0.156215933383464 -.003066620331537 25 -0.146392231014943 -0.151106799716217 -.004714568701274 26 -0.156215933383464 -0.156215933383464 0 27 -0.151166121871938 -0.151947260187626 -.000781138315688 28 -0.156215933383464 -0.156215933383464 0 29 -0.156215933383464 -0.156215933383464 0 30 -0.156215933383464 -0.156215933383464 0 31 -0.156215933383464 -0.156215933383464 0 32 -0.151671888309011 -0.149546985912821 .002124902396190 33 -0.156215933383464 -0.156215933383464 0 34 -0.156215933383464 -0.156215933383464 0 35 -0.156215933383464 -0.156215933383464 0 36 -0.156215933383464 -0.156215933383464 0 37 -0.156431286200132 -0.150376085927746 .006055200272386 38 -0.154562502593762 -0.146359470893003 .008203031700759 39 -0.156215933383464 -0.156215933383464 0 40 -0.156215933383464 -0.156215933383464 0 41 -0.156215933383464 -0.156215933383464 0 42 -0.156215933383464 -0.156215933383464 0 43 -0.156215933383464 -0.156215933383464 0 44 -0.156215933383464 -0.156215933383464 0 45 -0.156215933383464 -0.156215933383464 0 46 -0.156215933383464 -0.156215933383464 0 47 -0.156215933383464 -0.156215933383464 0

Thus, there are 7 kernels which do not behave themselves when they are re-tuned:

12x24x24x16 N4quda11axpyCGNorm2I7double26float26float4EE vol=110592,stride=110592,precision=4 128 1 1 102 1 1 2048 # 80.62 Gflop/s, 161.23 GB/s, tuned Mon Dec 8 11:36:39 2014 12x24x24x16 N4quda30caxpbypzYmbwcDotProductUYNormYI7double37double2S2_EE vol=110592,stride=110592,precision=4,vol=110592,stride=110592,precision=4 32 1 1 134 1 1 1536 # 53.38 Gflop/s, 83.03 GB/s, tuned Mon Dec 8 11:39:40 2014 12x24x24x16 N4quda3DotId6float26float4EE vol=110592,stride=110592,precision=4 96 1 1 236 1 1 768 # 40.79 Gflop/s, 163.17 GB/s, tuned Mon Dec 8 11:36:32 2014 12x24x24x16 N4quda4CdotI7double26float26float4EE vol=110592,stride=110592,precision=4 96 1 1 134 1 1 1536 # 76.99 Gflop/s, 153.98 GB/s, tuned Mon Dec 8 11:39:05 2014 12x24x24x16 N4quda5Norm2Id7double2S1_EE vol=110592,stride=110592,precision=8 224 1 1 116 1 1 1792 # 39.33 Gflop/s, 157.33 GB/s, tuned Mon Dec 8 11:36:17 2014 12x24x24x16 N4quda8xmyNorm2Id7double2S1_EE vol=110592,stride=110592,precision=8 128 1 1 179 1 1 1024 # 22.19 Gflop/s, 177.53 GB/s, tuned Mon Dec 8 11:37:16 2014 12x24x24x16 N4quda9CdotNormAI7double36float26float4EE vol=110592,stride=110592,precision=4 704 1 1 30 1 1 16896 # 108.23 Gflop/s, 144.31 GB/s, tuned Mon Dec 8 11:39:15 2014

It looks to me like one of the following happens:

1) the data fields were not backed up/restored correctly before/after tuning 2) one of the results generated during tuning was used in the subseqent calculation, instead of running the payload kernel with the tuned geometry

mathiaswagner commented 9 years ago

Thanks! Looks like as soon as any reduction is tuned in the active run it alters the result.

maddyscientist commented 9 years ago

I've just pushed a minor fix. I noticed that the TuneKey for the caxpbypzYmbwcDotProductUYNormY kernel listed single-precision twice (prec=4), instead of being a combination of both 4 and 8, since this is a double-single precision kernel. I don't think this affects this bug, but I mention it out of completeness since the tune cache will now change slightly with this latest push fixing this.

At the same time, I made the backup and restoration cleaner, as it now saves the entire field using the actual allocation size (before it used a hack to work out this size, this was put in as a workaround against Tesla compilation).

Anyway, it's probably worth retesting with respect to this bug: 979b7484d648d444e1ec3564e71070a612e9d2c3

One other thing, that shouldn't affect reproducibility but should be mentioned. The auto-tuning is switched off by default in the library, and is switched on when the inverter is first called. However, when the device interface is called, e.g., when using QDPJIT, then the prior loadGaugeQuda and loadCloverQuda interface functions also use kernels (that are are not tuned by default). Thus if one does the following:

loadGaugeQuda(...);
loadCloverQuda(...);
invertQuda(...); // switch on tuning here
loadGaugeQuda(...);
loadCloverQuda(...);
invertQuda(...);

At the end of invertQuda, if there are any changes to the tune cache, then it will be dumped to disk. Since the gauge and clover copy routines will not do auto tuning until second invocation, the tune cache will be updated after both invertQuda calls.

With changing the interface, to rectify this, the quda::setTuning(QUDA_TUNE_YES) should be called prior to loadGaugeQuda(). This will manually switch on the tuning ensuring that all kernels are tuned by the time the first invertQuda() is complete.

fwinter commented 9 years ago

As you already suggested your changes didn't fix the issue.

With QUDA (979b): rm tunecache Delta H = -0.145473702239997 Delta H = -0.149591471757049

With QUDA (979b) and turning on QUDA tuning before loadGauge/Clover: rm tunecache Delta H = -0.15279128137945 Delta H = -0.15027522004948

mathiaswagner commented 9 years ago

As this issue seems to be tricky: Do we have any idea whether this also appears on single GPU runs? I tried on a small volume but I assume the volume is too small to necessarily trigger the issue.

mathiaswagner commented 9 years ago

I could read through the code but one thing I just thought off: How is tuning handled in a multi GPU setup? I assume only one MPI rank takes care of creating the tunecache ? But do all ranks run the tuning? And if so, do the all use the same tune result?

So, the scenario that I think of is

Tuning with 2 GPUs: GPU0 finds best block size is 128. GPU1 finds best block size is 256. These sizes are then used for the final Kernel launch.

For the tunecache 128 is written to disk.

So, in the run without tuning we run with block size 128 on GPU0 and GPU1.

Mike, I guess you can answer that without digging through the code?

maddyscientist commented 9 years ago

I think this is it. Good deduction. Ron, can you comment on this, since it was you wrote this?

This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

mathiaswagner commented 9 years ago

I just looked through the tunelaunch function and did not see any communication. Can we change the tuning so that this is only done on one GPU? Or even average over all ranks to increase the statistics?

maddyscientist commented 9 years ago

Having the different processes communicate to ensure the same block is used throughout is definitely something that should be done.

However, there is something we have to be careful about here. When doing domain-decomposition, each GPU is solving a system independently of the others. The pathological case here is when one GPU doesn't even do any local solve (e.g., when doing DD on a point source, for the first few iterations, some local domains will have zero support and so never enter the solver loop). So the GPUs can be executing different kernels simultaneously and so we cannot rely on being able to globally synchronize (which is why the global sums are switched off when doing tuning in tune.cpp (line 333).


This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

mathiaswagner commented 9 years ago

Very important point. Maybe we need some kind of locking and let only one GPU do the tuning? But can we for now force a communication over all GPUs just to make sure this is the error? A brute force way would be to reduce elapsed_time to average the elapsed_time for a given launch configuration over all GPUs. That will break in cases like the one you described, but it would help to definitely nail down the issue.

maddyscientist commented 9 years ago

I'm pretty certain we (you) have nailed the issue, there's no question in my mind that this is a weakness that needs to be addressed.

In terms of the DD issue, we need a solution that is asynchronous and deterministic (these two things usually don't go hand in hand!). When a kernel is tuned for the first time (globally) we need to ensure that that result is broadcast everywhere once complete, so that when the same kernel is called elsewhere for the time we need to use the same value. Of the top of my head, a clear solution isn't coming to me, short of using one-sided communication.

Any ideas?

mathiaswagner commented 9 years ago

Asynchronous really makes it complicated. Nothing I like comes to my mind right now. Only focussing on reduction Kernels also does not really help here.

maddyscientist commented 9 years ago

Unless an easy solution presents itself, I think we can be clear that this issue isn't going to be fixed for 0.7.0, as I want to release this very soon.

mathiaswagner commented 9 years ago

I am just testing a hack which will break down in the asynchronous case. Anyhow if that works with MILC and also in Frank's case we understand the issue and since this has been around for a while (should also be in the 0.6 release) we can go on. Maybe we should add it as a known issue?

On Dec 10, 2014, at 16:17, mikeaclark notifications@github.com<mailto:notifications@github.com> wrote:

Unless an easy solution presents itself, I think we can be clear that this issue isn't going to be fixed for 0.7.0, as I want to release this very soon.

— Reply to this email directly or view it on GitHubhttps://github.com/lattice/quda/issues/182#issuecomment-66525628.

maddyscientist commented 9 years ago

Sounds like a plan. As long as it doesn't cause a hang for DD (which is something Balint and I battled for about a month when first introduced the auto tuner before route causing it to this divergence of execution between GPUs).

mathiaswagner commented 9 years ago

I don't want to get my hack into a release version. I assume it will cause DD to hang. It seems to fix the issue in MILC. I will send the file to Frank for testing on his case. If the hack fixes the issue we know what is going on and can think on ways to fix it.

mathiaswagner commented 9 years ago

So, I forced some communication after measuring the execution time in tune.cpp by using

  comm_time = elapsed_time;
  comm_allreduce(&comm_time);
  elapsed_time = float(comm_time/somm_size());

Frank just confirmed that this actually resolves the issues he found.

rm tunecache
Delta H = -0.147376751251613 After HMC trajectory call: time= 535.843235 secs
Delta H = -0.147376751251613 After HMC trajectory call: time= 388.536862 secs

But as Mike mentioned we need a communication that works also for DD. This is not an easy fix and we have to postpone this to beyond 0.7.

I suggest we add a comment to the README and create a new issue to discuss how to implement a non-blocking communication.

maddyscientist commented 9 years ago

I suggest we add a comment to the README and create a new issue to discuss how to implement a non-blocking communication.

Mathias, can you take care of this? Thanks.

rbabich commented 9 years ago

There's a related problem that we should fix at the same time, described in this comment in tune.cpp:

//FIXME: We should really check to see if any nodes have tuned a kernel that was not also tuned on node 0, since as things
//       stand, the corresponding launch parameters would never get cached to disk in this situation.  This will come up if we
//       ever support different subvolumes per GPU (as might be convenient for lattice volumes that don't divide evenly).
mathiaswagner commented 9 years ago

Good one. I read the comment a while ago but forgot about it. I will include it in the follow up issue.

On Dec 11, 2014, at 22:02, Ron Babich notifications@github.com<mailto:notifications@github.com> wrote:

There's a related problem that we should fix at the same time, described in this comment in tune.cpp:

//FIXME: We should really check to see if any nodes have tuned a kernel that was not also tuned on node 0, since as things // stand, the corresponding launch parameters would never get cached to disk in this situation. This will come up if we // ever support different subvolumes per GPU (as might be convenient for lattice volumes that don't divide evenly).

— Reply to this email directly or view it on GitHubhttps://github.com/lattice/quda/issues/182#issuecomment-66726282.

mathiaswagner commented 9 years ago

Added comment in README. Further fixing of this issue in #199.