lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
289 stars 97 forks source link

Investigate how to do a parallel build #34

Closed maddyscientist closed 10 years ago

maddyscientist commented 12 years ago

The compile time of QUDA, especially in multi-GPU mode, hinders development. We should really work out how to enable parallel building of QUDA, namely, how to split parallelize dslash_quda.cu. This is currently in a single file because of the file scope requirement of textures and constants.

This problem is only going to get exponentially worse as more and more kernels are incorporated in dslash_quda.cu.

rbabich commented 12 years ago

I strongly suspect that it's the compilation stage that's expensive, rather than the assembler.

I'm not sure, but it looks like it may be possible to look up a constant in a different file from where it was defined using cudaGetSymbolAddress(). This capability seems to have been added around CUDA 3.1.

rbabich commented 12 years ago

I take it back. cudaGetSymbolAddress() will give us the address in host code, but I don't have any ideas for accomplishing this in a kernel.

rbabich commented 12 years ago

I don't think this is really viable, since it would eat up a big chunk of shared memory on GT200, but we could stop using explicit constants altogether and put them into a single structure that would get passed into the kernel. This is what we already do for DiracParam, and on Fermi the structure would be read through the constant cache anyway.

Since most of the constants change infrequently, we could even put the structure in global memory and just pass a constant pointer into the kernel. On Fermi, reads from the structure would be recognized as "uniform accesses" and again go through the constant cache. Performance would be terrible on GT200, though.

maddyscientist commented 12 years ago

I like this solution.

I guess at some point we have to decide when to cut support for older version of the CUDA architecture going forward, e.g., keep sm 1.x support only in QUDA 0.4.x and require >= 2.x for QUDA 0.5.x. This isn't bad, since the GT200s are working fine with the current version of QUDA anyway, so they will continue to do so.

Note that this solution may also make multigrid much easier where we have multiple geometries to deal with. Do we realistically need GT200 support for multigrid? The only reason I can really think of to maintain 1.x support is for developmental reasons.......

rbabich commented 12 years ago

Hmm. Guochun's been doing his fatlink testing on Longhorn (which consists of C1060s), and there are a couple hundred GTX 285s at JLab, and most NV laptops in the wild are G92b, GT216, etc. (which I guess speaks to the "developmental reasons" you mention). I guess my feeling is that we can't stop supporting sm_1.x completely, and having to maintain 0.4.x sounds like a pain. Also, at least at the beginning, multigrid will probably be most useful for exactly the sort of capacity jobs running on the 285s at JLab, if we can make it work.

I wonder if we can have our cake and eat it too, though, by going with option 1 (passing in the structure rather than a pointer). This will hurt performance a little on GT200, but it might not be too bad. I count roughly 292 bytes worth of constants at the moment, so less than 5% of the 16KB shared memory on GT200. Of course, there might be some overhead associated with copying the parameters into shared memory. (Does this happen for each thread block, I wonder, or just once per SM?) I think we can probably eliminate at least half the constants, though, without hurting performance (e.g., many of the X4X3Xwhatever). That would be nice to do anyway.

maddyscientist commented 12 years ago

We might be able to reduce the footprint by using chars or shorts where possible, e.g., for X1 thru X4, we're never likely to require more than a local size of 256.

maddyscientist commented 12 years ago

Also note that shared memory footprint isn't that important GT200 since the dslash doesn't need any shared memory because of the large register-per-thread limit. Thus, I suggest we go with option 1.

maddyscientist commented 10 years ago

Closing this issue, since this problem has now been essentially solved by splitting the dslash_quda.cu file into multiple files (issue #68).