When quantizing the (input) activations to the bit-linear layer, NaNs may occur due to division by zero. This is a consequence of the formula in the original paper:
$Quant(x) = Clip(x \times \frac{Qb}{||x||\infty}, -Q_b + \epsilon, Q_b - \epsilon$
In the extreme case where all activations are zero, this will result in abs-max being zero, and thus a division by zero.
To fix this, I made sure to add 1e-10f to all maxes in the preset kernels. In 99.99% of cases, this will be a minor (or no) change, but in problematic cases, this avoids NaNs.
When quantizing the (input) activations to the bit-linear layer,
NaN
s may occur due to division by zero. This is a consequence of the formula in the original paper: $Quant(x) = Clip(x \times \frac{Qb}{||x||\infty}, -Q_b + \epsilon, Q_b - \epsilon$In the extreme case where all activations are zero, this will result in abs-max being zero, and thus a division by zero.
To fix this, I made sure to add
1e-10f
to all maxes in the preset kernels. In 99.99% of cases, this will be a minor (or no) change, but in problematic cases, this avoidsNaN
s.