[FP8] Implemented PerTensorQuantization and fp16 calibration for e4m3

octoml / mlc-llm

Enable everyone to develop, optimize and deploy AI models natively on everyone's devices.

https://mlc.ai/mlc-llm

Apache License 2.0

5 stars 8 forks source link

[FP8] Implemented PerTensorQuantization and fp16 calibration for e4m3 #232

Closed vinx13 closed 7 months ago

vinx13 commented 7 months ago

This PR adds PerTensorQuantization that supports quantizing both weight and activation using a single scale. Both e4m3_e4m3 and e4m3_e5m2 are migrated to the new quantization mode. Quantization for embedding layer is disabled for now because of some issues scheduling the dequantize-take kernel. We can clean up GroupQuantize after this flow has been validated.

vinx13 commented 7 months ago

I don't have concern about embedding, without dequant it can actually be slightly faster. We can merge and try this if the deployment timeline allows

csullivan commented 7 months ago

In that case, let me pick this and use it for the deployment trial tomorrow. Thank you @vinx13.