To improve loss surface observability, extra_grad_stats=True results in the calculation of grad_norm_var and grad_similarity.
grad_similary is somewhat expensive - it takes the sign of the gradient of all parameters in the network (1 bit per trainable parameter) so it can be compared with the sign of the previous gradient. This helps us determine if learning is stable. This metric increases runtime by about 10%.
The most expensive operations is _pack_bit_tensor(). _bit_tensor_sum() is already quite performant.
I'm not confident _pack_bit_tensor() can be made much faster. If not, we might consider tracking the gradient signs of a subset of the parameters.
To improve loss surface observability,
extra_grad_stats=True
results in the calculation ofgrad_norm_var
andgrad_similarity
.grad_similary
is somewhat expensive - it takes the sign of the gradient of all parameters in the network (1 bit per trainable parameter) so it can be compared with the sign of the previous gradient. This helps us determine if learning is stable. This metric increases runtime by about 10%.The most expensive operations is
_pack_bit_tensor()
._bit_tensor_sum()
is already quite performant.I'm not confident
_pack_bit_tensor()
can be made much faster. If not, we might consider tracking the gradient signs of a subset of the parameters.