huggingface / optimum-quanto

A pytorch quantization backend for optimum
Apache License 2.0
833 stars 62 forks source link

Clarification on Per-Channel vs. Per-Tensor Quantization for Weights and Activations #356

Open kirkdort44 opened 4 days ago

kirkdort44 commented 4 days ago

Hello,

I am a Ph.D. student currently working on efficient deep learning inference and have been reviewing the quantization strategies in Optimum Quanto , specifically in the context of the generation benchmark. and I have a few questions to confirm the implementation details:

Additionally, I’d like to clarify how static linear quantization is applied within Optimum Quanto ( in the context of the generation benchmark.):

Thank you for your work on this project! I appreciate any insights you can provide to better understand these implementations.