Capturing a stable state from iterating with the quantization simulator.
Primary changes:
Drastically simplify the way that linear and conv layers are configured. Now the primary knob is the presence of a q_input quantizer. If present, it will be used to quantize the activations prior to feeding to the underlying op.
Add qdq_input tensor to linear and conv, which can be used to simulate quantization for testing/compare/goldens/etc.
Implemented bias quantization (disabled for now because still tracking down numeric instability).
Added an option to the brevitas importer to bootstrap from a base safetensors file, only using the primary one for quantized layers. Not clear it is meaningful but aided some debugging.
Added an optimized quantized linear op override. This will get used for TensorScaled inputs and allows either an FP or TensorScaled bias. For the former, bias add will be done after dequant in FP. For the latter, it will be done prior to dequant and is responsible for providing the output scale.
Added a dedicated linear op.
Fixed saturate_cast to have an option to disable saturation (for int32 and such).
Reworked StaticScaledQuantizer to infer the axis in the case when scales/zp come in pre-broadcast.
Capturing a stable state from iterating with the quantization simulator.
Primary changes:
q_input
quantizer. If present, it will be used to quantize the activations prior to feeding to the underlying op.qdq_input
tensor to linear and conv, which can be used to simulate quantization for testing/compare/goldens/etc.linear
op.saturate_cast
to have an option to disable saturation (for int32 and such).