Investigate how to do a parallel build

maddyscientist commented 12 years ago

The compile time of QUDA, especially in multi-GPU mode, hinders development. We should really work out how to enable parallel building of QUDA, namely, how to split parallelize dslash_quda.cu. This is currently in a single file because of the file scope requirement of textures and constants.

The textures can be defined multiple times between files, this shouldn't cause a problem. There doesn't appear to be any need to have a texture bound for one kernel available for another kernel.
The constants are more preblematic. If we define multiple constants in different files how do these interact?
1. Do they overlap and require to be set every time a different file's constants ?
2. Do they occupy different areas of the constant memory, and thus if we have multiple constants declared we run the risk of running out of constant memory very quickly?
What is the most time consuming part of the compilation? Is it possible to split the output of the ptx generation and the final assembler and parallelize over the assembler only?

This problem is only going to get exponentially worse as more and more kernels are incorporated in dslash_quda.cu.

rbabich commented 12 years ago

I strongly suspect that it's the compilation stage that's expensive, rather than the assembler.

I'm not sure, but it looks like it may be possible to look up a constant in a different file from where it was defined using cudaGetSymbolAddress(). This capability seems to have been added around CUDA 3.1.

rbabich commented 12 years ago

I take it back. cudaGetSymbolAddress() will give us the address in host code, but I don't have any ideas for accomplishing this in a kernel.

rbabich commented 12 years ago

I don't think this is really viable, since it would eat up a big chunk of shared memory on GT200, but we could stop using explicit constants altogether and put them into a single structure that would get passed into the kernel. This is what we already do for DiracParam, and on Fermi the structure would be read through the constant cache anyway.

Since most of the constants change infrequently, we could even put the structure in global memory and just pass a constant pointer into the kernel. On Fermi, reads from the structure would be recognized as "uniform accesses" and again go through the constant cache. Performance would be terrible on GT200, though.

maddyscientist commented 12 years ago

I like this solution.

I guess at some point we have to decide when to cut support for older version of the CUDA architecture going forward, e.g., keep sm 1.x support only in QUDA 0.4.x and require >= 2.x for QUDA 0.5.x. This isn't bad, since the GT200s are working fine with the current version of QUDA anyway, so they will continue to do so.

Note that this solution may also make multigrid much easier where we have multiple geometries to deal with. Do we realistically need GT200 support for multigrid? The only reason I can really think of to maintain 1.x support is for developmental reasons.......

rbabich commented 12 years ago

Hmm. Guochun's been doing his fatlink testing on Longhorn (which consists of C1060s), and there are a couple hundred GTX 285s at JLab, and most NV laptops in the wild are G92b, GT216, etc. (which I guess speaks to the "developmental reasons" you mention). I guess my feeling is that we can't stop supporting sm_1.x completely, and having to maintain 0.4.x sounds like a pain. Also, at least at the beginning, multigrid will probably be most useful for exactly the sort of capacity jobs running on the 285s at JLab, if we can make it work.

I wonder if we can have our cake and eat it too, though, by going with option 1 (passing in the structure rather than a pointer). This will hurt performance a little on GT200, but it might not be too bad. I count roughly 292 bytes worth of constants at the moment, so less than 5% of the 16KB shared memory on GT200. Of course, there might be some overhead associated with copying the parameters into shared memory. (Does this happen for each thread block, I wonder, or just once per SM?) I think we can probably eliminate at least half the constants, though, without hurting performance (e.g., many of the X4X3Xwhatever). That would be nice to do anyway.

maddyscientist commented 12 years ago

We might be able to reduce the footprint by using chars or shorts where possible, e.g., for X1 thru X4, we're never likely to require more than a local size of 256.

maddyscientist commented 12 years ago

Also note that shared memory footprint isn't that important GT200 since the dslash doesn't need any shared memory because of the large register-per-thread limit. Thus, I suggest we go with option 1.

maddyscientist commented 10 years ago

Closing this issue, since this problem has now been essentially solved by splitting the dslash_quda.cu file into multiple files (issue #68).

lattice / quda

Investigate how to do a parallel build #34