I've recently read this tf-lite blogpost about faster inference with xnnpack, and would like to try it out in executorch.
I can see that there are qd8-f16 (and -f32) kernels being compiled when I compile xnnpack, but I'm not quite sure how to quantize my model with that config.
I guess that I would need a QuantizationSpec for activations that specifies the dtype. Is it enough to simply make a QuantizationSpec(dtype=torch.float16) and then use the other code from get_symmetric_quantization_config (torch.ao./../.xnnpack_quantizer) with is_dynamic=True?
I've recently read this tf-lite blogpost about faster inference with xnnpack, and would like to try it out in executorch.
I can see that there are qd8-f16 (and -f32) kernels being compiled when I compile xnnpack, but I'm not quite sure how to quantize my model with that config.
I guess that I would need a QuantizationSpec for activations that specifies the dtype. Is it enough to simply make a QuantizationSpec(dtype=torch.float16) and then use the other code from get_symmetric_quantization_config (torch.ao./../.xnnpack_quantizer) with is_dynamic=True?