lapp0 / distily

Distily: Language Model Distillation Toolkit and Library
GNU Affero General Public License v3.0
6 stars 0 forks source link

Improve `grad_similary` performance #3

Closed lapp0 closed 2 months ago

lapp0 commented 2 months ago

To improve loss surface observability, extra_grad_stats=True results in the calculation of grad_norm_var and grad_similarity.

grad_similary is somewhat expensive - it takes the sign of the gradient of all parameters in the network (1 bit per trainable parameter) so it can be compared with the sign of the previous gradient. This helps us determine if learning is stable. This metric increases runtime by about 10%.

The most expensive operations is _pack_bit_tensor(). _bit_tensor_sum() is already quite performant.

I'm not confident _pack_bit_tensor() can be made much faster. If not, we might consider tracking the gradient signs of a subset of the parameters.

lapp0 commented 2 months ago

Overhead reduced to ~1%