pytorch / executorch

On-device AI across mobile, embedded and edge for PyTorch
https://pytorch.org/executorch/
Other
2.22k stars 368 forks source link

qd8-f16 quantization in xnnpack #6510

Open mads-oestergaard opened 1 month ago

mads-oestergaard commented 1 month ago

I've recently read this tf-lite blogpost about faster inference with xnnpack, and would like to try it out in executorch.

I can see that there are qd8-f16 (and -f32) kernels being compiled when I compile xnnpack, but I'm not quite sure how to quantize my model with that config.

I guess that I would need a QuantizationSpec for activations that specifies the dtype. Is it enough to simply make a QuantizationSpec(dtype=torch.float16) and then use the other code from get_symmetric_quantization_config (torch.ao./../.xnnpack_quantizer) with is_dynamic=True?

JacobSzwejbka commented 3 weeks ago

@mcr229 Can you take a look?