I am a Ph.D. student currently working on efficient deep learning inference and have been reviewing the quantization strategies in Optimum Quanto , specifically in the context of the generation benchmark. and I have a few questions to confirm the implementation details:
Are weights always quantized per-channel (e.g., along the first dimension for layers)?
Are activations quantized per-tensor, applying a single scale across the entire tensor?
Are these settings consistent with the benchmarks mentioned above, or are there exceptions or additional considerations (e.g., support for other granularity levels)?
Additionally, I’d like to clarify how static linear quantization is applied within Optimum Quanto ( in the context of the generation benchmark.):
Does it statically determine the scales and zero-points for weights and activations during calibration?
Are there any dynamic adjustments post-calibration, or does the quantization remain static throughout inference?
Thank you for your work on this project! I appreciate any insights you can provide to better understand these implementations.
Hello,
I am a Ph.D. student currently working on efficient deep learning inference and have been reviewing the quantization strategies in Optimum Quanto , specifically in the context of the generation benchmark. and I have a few questions to confirm the implementation details:
Additionally, I’d like to clarify how static linear quantization is applied within Optimum Quanto ( in the context of the generation benchmark.):
Thank you for your work on this project! I appreciate any insights you can provide to better understand these implementations.