lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
287 stars 94 forks source link

Enable support for multiple right hand sides #36

Open maddyscientist opened 12 years ago

maddyscientist commented 12 years ago

A major optimization that can be explored is to see if amortizing the gauge field loads by acting on multiple spinors simultaneously will lead a large speed up. This is much more important for the staggered dslash kernel since this is the most bound by gauge field loads.

A simple way to explore this is introduce a y dimension to the thread blocks, this corresponds to the number of right hand sides.

The easiest way to explore this is probably to create a 5-dimensional spinor field, where the length of 5th dimension corresponds to the number of right hand sides. I believe the only changes required to the dslash kernel indexing would be adding to the spinor fields an offset corresponding to the y thread index multiplied by the length of the 4-dimensional spinor field.

spinor_index += threadIdx.y * 4d_length;

By definition, the gauge field indexing is independent of threadIdx.y.

Even with this simple change, there will likely be significant improvement in kernel performance since gauge field loads should obtain reuse through the L1 / texture cache / L2. Further improvement is likely possible through using shared memory management.

Beyond the kernel, there are multiple changes and additions required:

By moving a 5-d spinor field this complicates the communications packing and unpacking logic, which currently expects a 4-d spinor field. How to solve this requires some thought. I think a simple solution would be to create the contiguous 5-dimensional spinor fields, but also create an array of 4-dimensional spinor fields which are actually pointers to the corresponding parts of the 5-dimensional spinor field. Thus the communications routines can be called on the 4-dimensional reference spinor fields, but the dslash kernel deals with the 5-dimensional field (aside - this may be a very quick and dirty way to get multi-GPU domain wall).

mathiaswagner commented 9 years ago

One thing that we can also do: The mixed precision multi shift inverter requires refinement for all higher shifts. If we support using different masses for different source (rhs) we could use a multi source solver also for the refinement step...

maddyscientist commented 9 years ago

Yes, I've thought of that before. That would be very easy once we have a multi-src solver implemented, and give a nice little performance boost.

mathiaswagner commented 9 years ago

Yes, nothing spectacular. Just easier if we think of different masses right away. For HISQ one could even think about different Naik eps. But before doing that we need support for passing that from MILC to QUDA.