Feature/tune rank - Githubissues

maddyscientist commented 1 year ago

This branch adds the ability to do kernel tuning on a rank other than rank 0. There are two methods for setting the kernel tuning rank:

Set the environment variable QUDA_TUNING_RANK, this will set the kernel tuning rank globally for all kernels to the rank specified. E.g., QUDA_TUNING_RANK=3 will instruct QUDA to do all kernel tuning on rank 3.
On a per kernel basis, ~the value of the tuning rank can be set using the new optional fourth argument to tuneLaunch~ through specializing the virtual function int32_t Tunable::getTuningRank().

It should be noted that:

Per-kernel rank specification (2. above) will override the global env setting (1. above)
Default behaviour is untouched by these changes: kernel tuning will occur on rank 0.

~Setting this PR as draft for now, until it is confirmed this functionality meets the requirements of time-slice specific tuning.~

Noting this PR also fixes a bug in staggered_dslash_ctest.cpp, where the test_split_grid variable was uninitialized, leading to UB, and badness when different ranks had different random values for this variable.

maddyscientist commented 1 year ago

@alexstrel / @hwancheolJeong can you test this branch with the gauge smearing kernel, and confirm that this works as expected? I expect what you will want to do is pass to tuneLaunch a rank value that corresponds to a rank containing the relevant time slice.

alexstrel commented 1 year ago

thanks @maddyscientist for the feature. @hwancheolJeong now the line https://github.com/lattice/quda/blob/feature/tune-rank/lib/staggered_quark_smearing.cu#L33 has to be corrected.

hwancheolJeong commented 1 year ago

Thanks, @maddyscientist. And sorry for the long silence. I was on vacation (in Korea) and was sick for a while. Although I couldn't respond, I read comments from you and @weinbe2. I will take care of them one by one. (I also appreciate you and @weinbe2 for taking care of and approving the merge.)

I ran some quick tests with this branch on my home server with two K80s. On a 24^3 x 64 lattice, I smeared two time-wall sources of t=50 and t=22 in order. With the node geometry of (x,y,z,t)=(1,1,1,2), these two sources were handled by different ranks: rank 1 for t=50 and rank 0 for t=22. Note that the tuning was done with the t=50 source.

I tried the following three cases:

Without setting QUDA_TUNING_RANK and without a change in code
With QUDA_TUNING_RANK=1 and without a change in code
Without setting QUDA_TUNING_RANK and with appending the input 1 to tuneLaunch() in the code

All of them gave the correct result (smeared source). I deleted the tunecache before each run. Although I couldn't observe a noticeable difference in their running times, the saved tunecache indicated the fix worked (if I interpret it correctly). The followings are the corresponding lines in their tunecache.tsv.

Without setting QUDA_TUNING_RANK and without a change in code:

 24x24x24x32        N4quda15StaggeredQSmearINS_18StaggeredQSmearArgIdLi1ELi3ELi4EL21QudaReconstructType_s18EEEEE    policy,commDim=0000,parity=2,staggered_qsmear=3,tslice_kernel=1,topo=1112,order=10,p2p=1,gdr=0,nvshmem=0,pol=110000111100110000  32      1       1       1       1       1       0       1       1       0       -1      8.29082e-06     # 710.31 Gflop/s, 2000.86 GB/s, tuning took 0.012458 seconds at Fri Feb 10 14:28:49 2023
 24x24x24x32        N4quda15StaggeredQSmearINS_18StaggeredQSmearArgIdLi1ELi3ELi4EL21QudaReconstructType_s18EEEEE    policy_kernel=interior,comm=0001,commDim=0000,parity=2,staggered_qsmear=3,tslice_kernel=912      1       1       8       1       2       24577   -1      -1      -1      -1      6.95583e-06     # 846.63 Gflop/s, 2384.88 GB/s, tuning took 0.036561 seconds at Fri Feb 10 14:28:49 2023

With QUDA_TUNING_RANK=1 and without a change in code:

 24x24x24x32        N4quda15StaggeredQSmearINS_18StaggeredQSmearArgIdLi1ELi3ELi4EL21QudaReconstructType_s18EEEEE    policy,commDim=0000,parity=2,staggered_qsmear=3,tslice_kernel=1,topo=1112,order=10,p2p=1,gdr=0,nvshmem=0,pol=110000111100110000  32      1       1       1       1       1       0       6       1       0       -1      0.000100662     # 58.50 Gflop/s, 164.80 GB/s, tuning took 0.017047 seconds at Fri Feb 10 14:33:28 2023
 24x24x24x32        N4quda15StaggeredQSmearINS_18StaggeredQSmearArgIdLi1ELi3ELi4EL21QudaReconstructType_s18EEEEE    policy_kernel=interior,comm=0001,commDim=0000,parity=2,staggered_qsmear=3,tslice_kernel=64       1       1       108     1       2       8193    -1      -1      -1      -1      9.98368e-05     # 58.99 Gflop/s, 166.16 GB/s, tuning took 0.247901 seconds at Fri Feb 10 14:33:28 2023

Without setting QUDA_TUNING_RANK and with appending the input 1 to tuneLaunch() in the code:

 24x24x24x32        N4quda15StaggeredQSmearINS_18StaggeredQSmearArgIdLi1ELi3ELi4EL21QudaReconstructType_s18EEEEE    policy,commDim=0000,parity=2,staggered_qsmear=3,tslice_kernel=1,topo=1112,order=10,p2p=1,gdr=0,nvshmem=0,pol=110000111100110000  32      1       1       1       1       1       0       0       1       0       -1      8.61888e-06     # 683.27 Gflop/s, 1924.70 GB/s, tuning took 0.012781 seconds at Fri Feb 10 14:46:14 2023
 24x24x24x32        N4quda15StaggeredQSmearINS_18StaggeredQSmearArgIdLi1ELi3ELi4EL21QudaReconstructType_s18EEEEE    policy_kernel=interior,comm=0001,commDim=0000,parity=2,staggered_qsmear=3,tslice_kernel=64       1       1       108     1       2       8193    -1      -1      -1      -1      0.000100778     # 58.44 Gflop/s, 164.61 GB/s, tuning took 0.248454 seconds at Fri Feb 10 14:46:14 2023

Note that only the second tuning was fixed for the third case when the fix was applied to the code. I guess it was because I modified tuneLaunch() only in StaggeredQSmear::apply(), and the first tuning was called in somewhere else?

I will run these tests on a bigger machine (like Big Red 200) with a more complicated node geometry. In addition, I will think about a proper (general) way of setting the input rank value for tuneLaunch().

maddyscientist commented 1 year ago

Thanks for the update @hwancheolJeong.

I wouldn't expect a huge speedup from getting the tuning correct, since a kernel running only on a single time slice won't run for very long, even if poorly tuned.
For the third case, this is a policy as opposed to a policy kernel. Generally for policies, all GPUs will take part and there is synchronization between all GPUs since there is communication. I guess there is no communication / synchronization here, and so since the timing is being done by default on rank 0, it doesn't correctly capture the time. I'll have a think on how to solve this, perhaps when doing policy tuning there should be an enforced synchronization to get an accurate time.
For the proper way of setting the input rank, I guess you need to identify which ranks are on a given time slice, and pick, say, the lowest one.

hwancheolJeong commented 1 year ago

I modified the StaggeredQSmear kernel to accommodate this update: (https://github.com/hwancheolJeong/quda/commit/a51dc8736aaecd767c764efc41e8db9a1c1b2b7d). (It is in my fork.) I wonder where I should push this commit; to this branch, to develop branch after this branch is merged, or create a new branch and make pr after this branch is merged into develop?

I confirmed it worked correctly on two nodes of Summit. I split a 32^3 x 96 lattice into (x,y,z,t)=(1,2,2,3) and put two time-wall sources at t=90 and t=50. The smeared sources were correct, and I observed similar corrections in the tunecache as I reported above. (Let me know if you want to see the details.)

In tests, I printed the tune_rank value for each rank (in the squared bracket). For t0=90,

[9] tune_rank = 8
[2] tune_rank = 8
[6] tune_rank = 8
[11] tune_rank = 8
[1] tune_rank = 8
[10] tune_rank = 8
[3] tune_rank = 8
[7] tune_rank = 8
[5] tune_rank = 8
[8] tune_rank = 8
[4] tune_rank = 8
[0] tune_rank = 8

For t0=50,

[6] tune_rank = 4
[0] tune_rank = 4
[10] tune_rank = 4
[4] tune_rank = 4
[7] tune_rank = 4
[3] tune_rank = 4
[8] tune_rank = 4
[5] tune_rank = 4
[11] tune_rank = 4
[2] tune_rank = 4
[9] tune_rank = 4
[1] tune_rank = 4

Let me know if you have other ideas for me to test.

I have one question. In tuneLaunch() (https://github.com/lattice/quda/blob/feature/tune-rank/lib/tune.cpp#L796), shouldn't we check if tune_rank has the same value for all MPI ranks?

maddyscientist commented 1 year ago

I modified the StaggeredQSmear kernel to accommodate this update: (hwancheolJeong@a51dc87). (It is in my fork.) I wonder where I should push this commit; to this branch, to develop branch after this branch is merged, or create a new branch and make pr after this branch is merged into develop?

Since this feature is being added with the quark smearing in mind, I think it's ok to add your changes to this branch. So please go ahead and do that.

I have one question. In tuneLaunch() (https://github.com/lattice/quda/blob/feature/tune-rank/lib/tune.cpp#L796), shouldn't we check if tune_rank has the same value for all MPI ranks?

That's an excellent point. I have added such a check in 841850f.

hwancheolJeong commented 1 year ago

Thanks for adding the int32_t instantiation of comm_allreduce_min! I adjusted my code to use it and pushed the commit to this branch.

maddyscientist commented 1 year ago

Your code looks good @hwancheolJeong. One comment though, I think this will cause the tuning rank to be computed every time even after tuning is done; this will mean additional synchronization and communication. This might be a chicken and egg situation where don't know if we've tuned until we've called tuneLaunch, but we can't call tuneLaunch until we compute the tuning rank.

Thinking out loud: we need to defer execution of the tuning rank computation until we are in in the body of the tuning, e.g., https://github.com/lattice/quda/blob/feature/tune-rank/lib/tune.cpp#L796. Maybe instead of passing an integer tuning rank, we should be passing a functor / lambda that can be called which will compute the tuning rank when called. Any thoughts?

maddyscientist commented 1 year ago

@hwancheolJeong I've changed the tuneLaunch interface to now take a std::function to a function that computes the tuning rank, rather than the raw tuning rank itself. This should avoid the overhead I mentioned above. I've updated the quark smearing kernel for this new interface. Let me know what you think.

One other question (also maybe for @alexstrel): how do I run the tests you are using? When I run tests/staggered_gsmear_test this kernel isn't being called. How can I test this kernel using the internal tests?

maddyscientist commented 1 year ago

Another update: I've now addressed the issue where the dslash policy tuning would be done on a different rank (default 0) than the kernels. The DslashArg class now declares a default getTuningRank() method, which defaults to returning the default tuning rank, but this function is declared virtual to allow derived classes to override it. And it is this function that is passed to the tuneLaunch call in dslash_policy.cuh, ensuring that the kernel tuning rank is used to drive the policy tuning.

alexstrel commented 1 year ago

@maddyscientist application of the smearing operator for the fixed t0 is not supported in the test application, I missed the issue. (Working on enabling the option)

hwancheolJeong commented 1 year ago

@maddyscientist Sorry for the slow response. I wanted to give more thought to it. First of all, thanks for pointing it out! I was also concerned about those extra communications, since I observed too many outputs when I printed tune_rank values in my previous test.

I am afraid I don't understand the advantage of your idea of using the functional pointer. Doesn't it still need to run the allreduce routine (getTuningRank()) for every call? Could you give me some hints? I will try running tests with it anyway.

I use my own test code (https://github.com/hwancheolJeong/milc_qcd/tree/feature/gaussianSmearing/ks_gauss_smear), which makes use of my MILC code interface (https://github.com/hwancheolJeong/milc_qcd/blob/feature/gaussianSmearing/generic_ks/gauss_smear_ks_QUDA.c). I didn't take part in the QUDA's internal test code.

Here is an "idea" to avoid the extra redundant MPI calls: https://github.com/hwancheolJeong/quda/commit/2cde966335114fdedaaaf53118ea13e723b705cb (in my fork). It adds a flag tuneRankReset and two functions handling it. The allreduce part (or getTuningRank() in your update) will be executed only when tuneRankReset is enabled (=1) or tune_rank has never been set. And the determined tune_rank is saved in a static member tune_rank_static for future reuse. It works correctly:

[0] check tune_rank.
[1] check tune_rank.
[0] check tune_rank.
[1] check tune_rank.
...
[0] check tune_rank.
[1] check tune_rank.
[0] comm_allreduce_min is called.
[0] tune_rank = 1
[1] comm_allreduce_min is called.
[1] tune_rank = 1
[1] check tune_rank.
[0] check tune_rank.
[1] tune_rank = 1
[0] tune_rank = 1
[0] tune_rank = 1
[1] tune_rank = 1
...
[0] tune_rank = 1
[1] tune_rank = 1
[GS_TIME] QUDA two-link Gaussian smearing: time = 0.103655 s, iters = 50
Wrote source for color 2 time slice 50
[1] comm_allreduce_min is called.
[1] tune_rank = 0
[0] comm_allreduce_min is called.
[0] tune_rank = 0
[1] tune_rank = 0
[0] tune_rank = 0
...

comm_allreduce_min is called only at the first iteration, and the tune_rank value is stored for the rest of the iterations (although they don't run tuning). And the tune_rank is re-computed and changed at the second source smearing (again, although it does not run tuning). In the meantime, tuneLaunch()'s tune_rank consistency check is executed only when the actual tuning is run.

However, it is a little tricky, so I am not sure if it is acceptable to QUDA's policy.

hwancheolJeong commented 1 year ago

@maddyscientist I synced your commits, and now I see your idea. The getTuningRank() function is called only when an actual tuning is performed. That's why you changed the input to a function pointer. I think it should work fine and is better and clearer than my idea. I will give it more thought and leave my feedback again with the test result.

hwancheolJeong commented 1 year ago

Here is the test result of using the function pointer way. I only added some outputs similar to before.

[0] check tune_rank. tune_rank = 0
[1] check tune_rank. tune_rank = 0
...
[0] check tune_rank. tune_rank = 0
[1] check tune_rank. tune_rank = 0
[0] comm_allreduce_min is called. tune_rank = 1
[0] check tune_rank. tune_rank = 1
[1] comm_allreduce_min is called. tune_rank = 1
[1] check tune_rank. tune_rank = 1
[0] comm_allreduce_min is called. tune_rank = 1
[1] comm_allreduce_min is called. tune_rank = 1
[0] check tune_rank. tune_rank = 0
[1] check tune_rank. tune_rank = 0
[0] check tune_rank. tune_rank = 0
[1] check tune_rank. tune_rank = 0
...

It worked correctly. One difference from my idea was that comm_allreduce_min (from getTuningRank()) is called one more time here (because tuning was done two times.) for the first source. But it didn't call getTuningRank() for the second source at all, because it didn't need to run tuning again. I thought more about it but couldn't think of any problem or disadvantage. Rather, it will be safer than my idea, where developers should control the tune_rank carefully. I think it's good. :)

maddyscientist commented 1 year ago

Thanks @hwancheolJeong for the thoughts and tests. I realized that a cleaner solution is to make a virtual function getTuneRank() in the Tunable class, and then override this as needed in the child classes. Since the Tunable class instance is already being passed to the tuneLaunch function, this is a more natural solution.

@alexstrel I see you've pushed some changes to enable the t-slice option. When you've finished on your edits, can you also add the necessary tests to the ctest list as well please?

hwancheolJeong commented 1 year ago

@maddyscientist I like this update. xD It's much cleaner. I verified that it works correctly.

BTW, I had a chance to take a look at the internal staggered_gsmear_test test. (You might already figure it out.) If you just want to run a simple test with the t-slice feature, you only need to change the source type to a point source and set t0 to the source's time coordinate.

https://github.com/lattice/quda/blob/feature/tune-rank/tests/staggered_gsmear_test_utils.h#L270

      //spinor.Source(QUDA_RANDOM_SOURCE);
      int x[4] = { 0, 0, 0, 20 };
      int X[4] = { Z[0], Z[1], Z[2], Z[3] };
      spinor.Source(QUDA_POINT_SOURCE, linkIndex( x, X ), 0, 0);

https://github.com/lattice/quda/blob/feature/tune-rank/tests/staggered_gsmear_test_utils.h#L330

      //qsm_param.t0 = -1;
      qsm_param.t0 = 20;

(I had to include index_helper.cuh to use linkIndex().) An example execution command is

$ mpirun -np 2 ./staggered_gsmear_test --test GaussianSmear --laplace3D 3 --dim 8 8 8 16 --gridsize 1 1 1 2

maddyscientist commented 1 year ago

Marking this PR as ready for review. In addition to the description above, this PR also:

Adds a new unit test in tune_test.cpp which is for testing the tune rank feature
I have added a new header test.h which abstracts some of the boilerplate from tune_test.cpp, and also applied this to some of the other tests(dilution, io, blas interface, gauge alg).
- This is just a first stake in the ground at future cleanup planned for tests, and isn't meant to be the finished article. We can refine and extend it in future PRs
Adds missing tests for staggered quark smearing and some cleanup of the driver code (thanks @alexstrel)
Enables SANITIZE builds with heterogenous atomics enabled (blocking issues long since fixed in libcu++)
Added new function quda::device::init_thread() which is required to enable access to the GPU on spawned threads (required for tune_test.cpp)
Adds new environment variable QUDA_ENABLE_TUNING_SHARED which can be used to disable shared memory overallocation tuning (useful for testing how much benefit this brings to a given kernel)
Fix nvc++ compilation due to its lack of support for link compression

maddyscientist commented 1 year ago

And I have now updated the PR description to match the final design route taken for specifying the tuning rank, e.g., through overriding the virtual method int32_t Tunable::getTuningRank().

weinbe2 commented 1 year ago

General note, please add QUDA_ENABLE_TUNING_SHARED and QUDA_TUNING_RANK to the environment variable wiki page.

maddyscientist commented 1 year ago

General note, please add QUDA_ENABLE_TUNING_SHARED and QUDA_TUNING_RANK to the environment variable wiki page.

done

lattice / quda

Feature/tune rank #1359