Communicate tunecache during runs when tuning is active in Multi-GPU runs

When tuning is active during Multi-GPU runs each GPU independently tunes each Kernel. This results in different GPUs using different launch configurations for the final Kernel launch and finally makes binary reproducibility impossible. This was first discovered in #182.

While a simple global reduction over the elapsed time during the tuning can help in synchronous runs it will cause hangs when using asynchronous algorithms like DD where each GPU works on a local problem and may not even launch the tuning process for a specific Kernel.

This then also relates to the issue mentioned in tune.cpp

//FIXME: We should really check to see if any nodes have tuned a kernel that was not also tuned on node 0, since as things
//       stand, the corresponding launch parameters would never get cached to disk in this situation.  This will come up if we
//       ever support different sub volumes per GPU (as might be convenient for lattice volumes that don't divide evenly).

We need a non blocking solution to that.

lattice / quda

Communicate tunecache during runs when tuning is active in Multi-GPU runs #199