Improve `grad_similary` performance

To improve loss surface observability, extra_grad_stats=True results in the calculation of grad_norm_var and grad_similarity.

grad_similary is somewhat expensive - it takes the sign of the gradient of all parameters in the network (1 bit per trainable parameter) so it can be compared with the sign of the previous gradient. This helps us determine if learning is stable. This metric increases runtime by about 10%.

The most expensive operations is _pack_bit_tensor(). _bit_tensor_sum() is already quite performant.

I'm not confident _pack_bit_tensor() can be made much faster. If not, we might consider tracking the gradient signs of a subset of the parameters.

lapp0 / distily

Improve `grad_similary` performance #3