glotzerlab / hoomd-blue

Molecular dynamics and Monte Carlo soft matter simulation on GPUs.
http://glotzerlab.engin.umich.edu/hoomd-blue
BSD 3-Clause "New" or "Revised" License
330 stars 127 forks source link

CUDA 9+Volta compatibility #292

Closed joaander closed 6 years ago

joaander commented 6 years ago

Original report by Michael Howard (Bitbucket: mphoward, GitHub: mphoward).


Compiling HOOMD with CUDA 9 for Volta architecture (sm_70) gives a slew of compile warnings related to deprecated "sm30" intrinsics (__shfl, etc.), which are now replaced by "sync"-ing versions of these same methods that take a member thread mask. It seems that in general any warp synchronous programming should be avoided due to Volta's independent thread scheduling... this is harder to track down, but hopefully there isn't any to worry about.

Some of the warnings are coming from CUB, some are coming from hoomd code, and some seem to be coming from moderngpu. Most of these intrinsics are used for warp scan / reduce, with various "my_*" versions implemented. CUB has templated functions that can do these operations, and also has wrappers around some of the lower-level shuffle instructions. I would propose to do the following:

  1. Bump CUB to version 1.7.4, which claims to support Volta and should silence any warnings from that library.

  2. Look for "easy to replace" operations (scan, reduce, up, down) that can be done using CUB, and update those parts of the code.

  3. For remaining warnings, look to phase out moderngpu in favor of either CUB or thrust. moderngpu doesn't appear to have been significantly developed in quite a while, so it is unclear how reliable it will be on Volta. Plus, this should lop off an external dependency.

Thoughts?

joaander commented 6 years ago

Use cub 1.7.4

refs #292

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).


I am playing with the thought of testing the gunrock library for connected components labeling. https://github.com/gunrock/gunrock

If successful, I'd make it a (mostly header-only) dependency.

It seems to be quite well engineered and was recommended to me by an nVIDIA engineer. Upon taking a closer look, it appears to require moderngpu 2.0 as a dependency (sigh). https://github.com/moderngpu/moderngpu

The last major development effort for moderngpu occurred in 2016, though it may not be as hard to make that one volta compatible if we'd need to. It should compile with C++11 though.

Any thoughts on gunrock though?

joaander commented 6 years ago

Original comment by Michael Howard (Bitbucket: mphoward, GitHub: mphoward).


I'm mainly opposed to adding in moderngpu since I don't think hoomd should depend on three libraries providing similar functionalities. It looks like the author of gunrock added some commits to make moderngpu 1.1 compatible with cuda 9, so I guess it is possible with not too much effort.

Is it absolutely necessary to use that library for what you want to do, or are there other options?

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).


I see, it indeed looks like they're using 1.1, not 2.0. I, too, would be happy if a library didn't introduce dependencies of its own and we didn't have to keep copies of libraries with similar functionality around.

I did evaluate nvgraph, which comes with the CUDA toolkit, and while I got connected component labeling (CCL) working with an almost naive BFS, it was terribly slow in the general case of many small connected components. And CCL isn't really an officially supported functionality.

There is also the groute library. It has fewer dependencies, that is, only CMake 3.2, gcc 4.9 and CUDA 7.5, and gunrock and groute compare against each other in their papers / in their code. After I failed to initially compile gunrock due to an internal nvcc error in conjunction with gcc6, I'll give that one a try, too.

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).


OK. I was lucky. I found a minimal and fast (!) implementation of CC. See here and for the code, see here.

No need for bloated libraries with a lot of dependencies, just a single .cuh file!

joaander commented 6 years ago

Nice!

joaander commented 6 years ago

Original comment by Michael Howard (Bitbucket: mphoward, GitHub: mphoward).


Cool!

I've been trying to get CUB to work instead of the handwritten warp scan / reduce codes, but I'm running into issues. When I use the whole warp to process one particle, everything is fine, but it doesn't seem to like using smaller "logical" warps. For example, the scan will sometimes fail even when using 1 thread per particle and the scan doesn't do anything. The failures depend on the data type being reduced--int seems to be OK (usually), whereas char does not.

Thoughts?

joaander commented 6 years ago

You aren't the only one. Others have reported issues with warp scan and small logical warps: https://github.com/NVlabs/cub/issues/112

If cub's sub-warp primitives are not reliable then perhaps we should fall back on plan b and consolidate our own sub-warp scan routine into one place in the code. Hopefully we still can rely on cub for block-wide primitives....

As far int/char - I'm not sure there is anything to gain with char values in registers. I haven't checked recent arches, but it used to be the case that registers were at least 32-bits.

joaander commented 6 years ago

Original comment by Michael Howard (Bitbucket: mphoward, GitHub: mphoward).


I agree, it is probably easier to just consolidate the handful of sub-warp primitives that we need rather than try to debug the code paths CUB might be taking. (It's not like we were using it out-of-the-box anyway since we were specializing for shuffle.) We could still take advantage of CUB's ShuffleUp / ShuffleDown wrappers, which hide the need for _sync and will work for arbitrary data types. We should probably try to set the bits in the member mask correctly, since I think early exiting threads (e.g., in the neighbor list) could lead to deadlock if we use 0xfffffff. But, we will need to test on Volta to know that.

The char / int thing is a mystery.

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).


It seems unnecessary to use a wrapper for such a lightweight function such as __shfl (EDIT: perhaps the templates you mentioned are an argument). And we have to face migration to _sync anyway. In the neighbor list, what do you mean by early exiting threads? I thought the purpose of _sync was to coerce the warp together? And I don't remember that there was intra-warp divergence in the current neighbor list code... (except if the warp segment was smaller than 32 threads of course)

joaander commented 6 years ago

Original comment by Michael Howard (Bitbucket: mphoward, GitHub: mphoward).


The reason to use the wrapper is for 64-bit cases like in PotentialPairGPU.cuh, where there is __my_shfl_down, which wraps around __shfl_down for float and is two shuffle instructions with an int2 for a double. We would then need to use preprocessor macros to support __shfl_down vs __shfl_down_sync for CUDA < 9. If we're already using wrappers like this, why not one from a library that we already have that supports all data types?

And yes, the purpose of sync is to coerce threads together for the instruction, but don't you get deadlock if a thread never enters that synchronization point? In NeighborListGPUBinned.cu, there is a break statement for the first thread in the cooperative group, while the rest of the threads from the group proceed even if they have no work (ignoring any distance checks). Hence, when you go to do the scan with sync, there could be a problem since the first thread is never supposed to get there. I honestly don't quite get why it works now. (I guess the shuffle can still read whatever the last value in the thread0 register was, which is set to 0 before the break?) Groups within the warp can also presumably exit early from the while loop if you have independent thread scheduling, since each group may run out of particles to process at different times, meaning that synchronizing all threads could also stall.

Please correct me if I'm wrong about either of those!

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).


I'll have to look at the code in NeighborListGPUBinned.cu again and let you know. My recollection is that it is warp synchronous programming, but not obviously so. (But I'll have to check)

joaander commented 6 years ago

Original comment by Michael Howard (Bitbucket: mphoward, GitHub: mphoward).


When I was debugging the cub WarpScan, I put some printf statements at the end of the while loop and not all threads were printing anything and the number of prints on even the first iteration was less than 32, which is why I thought there was early exit. That's for cuda 8 on gtx 1080.

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).


Hmm, I assume you noticed line 209 in NeighborListGPUBinned.cu which should ensure that the full warp segment makes it through the iteration? Also, only a subset of the warp (the segment) participates in the scan.

joaander commented 6 years ago

Original comment by Michael Howard (Bitbucket: mphoward, GitHub: mphoward).


Doesn't line 209 break thread0 from the loop and leave the others in the segment running through the iteration? If thread0 is done then all threads in the segment are done, but only thread0 has !(threadIdx.x % threads_per_particle) == true since the mod evaluates nonzero for all other threads in the segment, right? It seems like what you want is to broadcast done from thread0 to the rest of the segment. But, there's no reason to do this synchronization since when all threads have done == true, they will all skip the cell list check (211-264), the scan will evaluate all 0s so no writing gets done, and then it will terminate the loop since done is true.

If threads are independent, I think you now have to worry about segments terminating this loop at different times, so forcing all threads in the warp to sync is not good. Also, when the number of threads launched is not a multiple of 32, you also can get threads in the warp returning before they ever reach that sync point.

joaander commented 6 years ago

Original comment by Michael Howard (Bitbucket: mphoward, GitHub: mphoward).


To follow up on this, I think that I ended up finding a bug in the CUB scan code that wasn't working (it has to do with the implementation of a shuffle instruction). I commented on the issue that Josh linked above, but I'm not hopeful that this will be resolved anytime soon. AFAIK, CUB is not using sub-warp primitives for any of the device-wide operations that we use, so these should be safe.

As Jens points out, the types of operations that we are doing are pretty simple. We would likely get slightly better performance by using CUB's optimized instructions for special data types / operators, but we certainly won't hurt anything by continuing to use the generic code that we have. Also, shuffle seems to play nice with both float and double for the versions of CUDA we support, even if it is not documented in older versions, so we don't need to worry about that.

I propose to follow plan b and consolidate the warp codes that we have in one header. I will write light wrappers around the shuffle functions to support the sync / nonsyncing versions. These wrappers will take a member mask, which we can compute by bitshifting based on the size of the subwarp, but do nothing with it for CUDA < 9. I think all that we need for now are inclusive / exclusive prefix sum and sum reduction. How does that sound?

joaander commented 6 years ago

Your plan b sounds good, and is a fairly minimal change compared to what we have now. Thanks for taking the time and effort to work on this!

joaander commented 6 years ago

Original comment by Michael Howard (Bitbucket: mphoward, GitHub: mphoward).


I spoke too soon and CUB has issued a bug fix for sub-warp primitives (v1.8.0). Could you bump CUB to this release and I will try again?

joaander commented 6 years ago

Bump cub to v1.8.0

refs #292

joaander commented 6 years ago

Excellent. I've bumped cub up to v1.8.0.

joaander commented 6 years ago

Original comment by Michael Howard (Bitbucket: mphoward, GitHub: mphoward).


Update: I successfully replaced the warp scans in the neighbor lists with the updated CUB, which was straightforward to do. There is an issue with the warp reduce for the pair potentials though. Currently, the number of threads per particle is a runtime variable for the pair potentials, whereas it was a template parameter for the neighbor list. CUB needs the (sub-)warp size to be a templated argument so that it can unroll the for loop of shuffles.

It wouldn't be too hard to add this template param to the pair potentials, but there is a possible issue that we are blowing up the size of binaries for the pair potentials by doing this. There are currently 3 flags as template args: (1) shift mode, (2) whether to compute the virial, and (3) whether to read the neighbor list from global memory (if it exceeds the max size of a 1d texture). That's 8 kernels that all get templated. If we add the threads per particle, that will increase the number of templates 6x, which could be very large.

The third flag could be dropped--it is only necessary for compute capability < 35, which doesn't support ldg. We could just use a few defines in the kernel to ldg when available, and global memory if not, if we don't need to bind textures on the host. Do you think there is a significant performance benefit (e.g., dropping number of registers) from using templates for the other two flags to justify keeping them at the expense of large binaries?

joaander commented 6 years ago

Anything needed for compute 2.x support can be dropped. I guess we still need the non ldg code path for compute 3.0.

At the time those two template flags were written in (G200 or so), they made a big difference. It is worth checking on modern hardware if they provide a benefit. We should run some benchmarks and check. To keep all other variables controlled, we can just run the same benchmark with existing code and another with both variables forced to true and look at the kernel timings in the profiler.

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).


I am worried not only about binary size but also compile time. If you find that using templated CUB warp scans provides a benefit, not only in maintainability but perhaps also performance, you might want to put every potential in a separate .cu file to make parallel compilation possible.

joaander commented 6 years ago

Original comment by Michael Howard (Bitbucket: mphoward, GitHub: mphoward).


OK, what would be the appropriate benchmark hardware--P100? One possible issue with forcing the flags to true is that they trigger additional computations, so it doesn't give a true measure of perf. difference from including the code path in the compilation vs. skipping / following it at runtime.

joaander commented 6 years ago

Original comment by Michael Howard (Bitbucket: mphoward, GitHub: mphoward).


(Another option is to just leave it alone and have a non-CUB version of the warp tools that can be configured at runtime. But, this is again going down the path of whether or not it's worth using CUB. Maybe there is a fundamental difference in design philosophy of configuring kernel launch parameters at compile-time vs. runtime that is challenging to resolve.)

joaander commented 6 years ago

You are right, to appropriately test, we would need to move the template arguments to function call arguments and see how it performs. And yes, P100 is a reasonable minimum requirement for performance tuning now that K20 systems are shortly on their way out.

I can test this locally on P100, but I don't know when I will have the bandwidth to do so. April at the earliest. I'll bump this back to release in v2.4.

joaander commented 6 years ago

Original comment by Michael Howard (Bitbucket: mphoward, GitHub: mphoward).


OK, sounds good. I am in Europe the next two weeks and also finishing thesis, so I have very limited time as well in the near future. I will try to finish the CUB / template changes (letting it just compile the large binary), and then performance testing can be done from there.

joaander commented 6 years ago

Original comment by Michael Howard (Bitbucket: mphoward, GitHub: mphoward).


Pasting this in here from pull request #458 to keep the discussion in one place:

This PR replaces handwritten warp scans and reductions with the CUB equivalents. In order to force using only the shuffle-based versions of CUB, I have written wrappers around the warp operations, which are now in hoomd/WarpTools.cuh. The operators are all unit tested in isolation, so we can have a bit of confidence there. (This was helpful tracking down a bug in CUB.) Unfortunately, I don't think any of the classes actually test with multiple threads, since the autotuner spits back 1 initially.

These are templated operators, and so the size of the subwarp group (threads per X) is selected using template tricks. This has slowed down compilation, particularly of the pair potentials, and probably bloated the binary. As we discussed before, this could be sped up by eliminating some of the template parameters if performance is not affected. But, I don't have time to do the benchmarks right now, so I wanted to submit this PR and then maybe someone else can work on that. I also don't have access to a V100 anymore, so this will need to be checked for correctness on that hardware too.

I had to modify the Tersoff potential kernel launch quite a bit, so it would be good if someone who knows more about that code could check it.

I also modified the cell-list neighbor lists so that threads did not exit early before the scan. It would be good to have a look at this as well.

The anisotropic pair potential is using a reduction, but it is not shuffle-based (I don't know why). This could be made shuffle-based by copying from PotentialPairGPU, but I left it alone for now.

HPMC is using __shfl_xor to do butterfly reduction. This could be replaced by a reduce and broadcast with CUB, or by implementing butterfly reduction using CUB operations.

joaander commented 6 years ago

Use CUB for warp scan in neighbor lists.

Since the warp scan requires Kepler and above, Fermi support and workarounds are dropped from the neighbor lists.

refs #292

joaander commented 6 years ago

Add wrapper to CUB warp reduce

refs #292

joaander commented 6 years ago

Error out when sm70 build requested

refs #292

joaander commented 6 years ago

Original comment by Jens Glaser (Bitbucket: jens_glaser, GitHub: jglaser).


just trying this branch.. seems about 15% faster in a single precision rigid body simulation than master. Is this expected? In any case, this is good

joaander commented 6 years ago

Merged in cuda9-v100 (pull request #459)

Use CUB for warp reductions and scans

fixes #204 fixes #292

Approved-by: Jens Glaser jsglaser@umich.edu Approved-by: Michael Howard mphoward@utexas.edu