Reduce the texture binding overhead through caching

lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.

https://lattice.github.io/quda

Other

286 stars 94 forks source link

Reduce the texture binding overhead through caching #89

Closed maddyscientist closed 11 years ago

maddyscientist commented 11 years ago

When we are strong scaling, one of the performance limiters is continually rebinding textures. We should work out a mechanism to cache texture binding so that within the course of a solve (or any other algorithm), textures are not continuously being binded.

This behaviour probably should be done with thought to issues #34, #65 and #68. My first suggestion would be to use separate textures for each and every blas kernel. Each kernel launcher would keep track of whether the input field changed from the previous invocation to that kernel requiring that the texture needs to be bound to the new address. The one issue that comes into play here is that we may run out of texture references and so we need would need also to keep some kind of FIFO stack keeping track of which texture should be unbound when we reach the hardware limit.

Comments?

rbabich commented 11 years ago

My first suggestion would be to use separate textures for each and every blas kernel.

Unfortunately, this brute force approach is likely to do more harm than good. It turns out that a kernel launch incurs some overhead for each texture bound, even if that texture isn't used by the kernel.

maddyscientist commented 11 years ago

The solution appears to be to use texture objects instead of using texture references. This is a Kepler only feature, so we will have to continue to use texture references for Fermi and prior architectures.

maddyscientist commented 11 years ago

Regarding texture objects. This will require a non-trivial amount of work since:

Creating and destroying texture objects is should only be done once due to the significant overhead in performing these. I would suggest that whenever a cudaGaugeField or cudaColorSpinorField is created, that the associated texture object is created at the same time. It would then be destroyed when the destructor is called.
Using Texture objects requires that the textures are passed explicitly to the kernel, as opposed to a static reference. This means that the dslash kernels will have to be rewritten for this. The blas kernels already use Texture-like objects, so the work there is likely less involved.

Lastly I note, that the overhead of texture references seems to have a significant impact on the performance of multi-GPU code performance especially (observed from profiling). For large-scale running on Titan, fixing this should be fairly high priority.

maddyscientist commented 11 years ago

This has now been fully implemented for cudaColorSpinorFields and cudaGaugeFields: these classes now have member cudaTextureObject_t classes that are created and destroyed when the fields are created and destroyed. This means they are only allocated once. These are employed in all blas and dslash kernels. These texture objects are passed explicitly to the kernels in a class. Compatibility has been retained with prior architectures has been retained, though it will be nice when we can deprecate the older architectures since texture references are so cumbersome.

For Kepler architectures, texture objects are the default, though one can revert to using texture references by undefining the USE_TEXTURE_OBJECTS macro in include/quda_internal.h.

I consider this issue closed now (commit 4af1fa6880559bf1be5177ea63977fbf7bd936c6) since it is deployed completely for the linear solvers, though it still needs to be extended to cover the force and fat link kernels.