Added cross gpu batch normalization

Added support for batch normalization to compute aggregated mean and variance across multiple gpus. This can be useful in settings where the batch size on each individual GPU is small (e.g. <= 16 batch size on each GPU). The default behavior is to not use this function, so by default nothing will change. Also added support for terminating after NaN losses.