Turns out that by actually estimating the weight ranges + overloading the weight buffer with a mini header that's used for dequantization (we need two f32s passed onwards), we can do much better than just down-casting and re-casting. Appears to offer noise-level approximation quality on reasonable offline data.
This immediately opened an interesting side result that's not part of this PR - it seems we might be able to pull this off with just one byte too (minor decay in performance, nothing too drastic)
Use is via an additional (optional, it's backwards compatible) param --quantize_weights that has to be run during the weight conversion phase (pre-inference).
Added also a test prior to each quantization that panics if weight distribution is too skewed (can happen with corrupted models) - this wasn't really tested before, yet should be as it's a cheap sanity check pre-prod.
Turns out that by actually estimating the weight ranges + overloading the weight buffer with a mini header that's used for dequantization (we need two f32s passed onwards), we can do much better than just down-casting and re-casting. Appears to offer noise-level approximation quality on reasonable offline data.
This immediately opened an interesting side result that's not part of this PR - it seems we might be able to pull this off with just one byte too (minor decay in performance, nothing too drastic)
Use is via an additional (optional, it's backwards compatible) param
--quantize_weights
that has to be run during the weight conversion phase (pre-inference).Added also a test prior to each quantization that panics if weight distribution is too skewed (can happen with corrupted models) - this wasn't really tested before, yet should be as it's a cheap sanity check pre-prod.