Closed vinx13 closed 7 months ago
I don't have concern about embedding, without dequant it can actually be slightly faster. We can merge and try this if the deployment timeline allows
In that case, let me pick this and use it for the deployment trial tomorrow. Thank you @vinx13.
This PR adds
PerTensorQuantization
that supports quantizing both weight and activation using a single scale. Bothe4m3_e4m3
ande4m3_e5m2
are migrated to the new quantization mode. Quantization for embedding layer is disabled for now because of some issues scheduling thedequantize-take
kernel. We can clean upGroupQuantize
after this flow has been validated.