pytorch / ao

PyTorch native quantization and sparsity for training and inference
BSD 3-Clause "New" or "Revised" License
1.55k stars 165 forks source link

AO dtype composability tracker #844

Open msaroufim opened 2 months ago

msaroufim commented 2 months ago

As we start onboarding more dtypes we ideally want them to work in as many different situations as possible so opening this tracker and will update the table as things change. If I should be adding more columns or rows or if there's any cells you disagree with please let me know!

The columns can also compose with each other but to be explicit

  1. training with FSDP2 should compose with low bit optimizers
  2. Inference quantization and KV cache quantization should compose

And sparsity IIUC only works with in8 inference quantization right now

Dtype Training with FSDP2 Inference Optimizer QAT KV cache Notes
Int8 Experimental Yes Yes LUT based Yes
Int4 No Yes Yes LUT based No
Fp8 Yes Yes Yes Not needed No
NF4 Yes Experimental No In progress No Does not use quantize api
fp6 No Yes No No No
UintX/Fpx In progress Yes No No No Still requires more performance work
MX: fp8/6/4 with scales Emulation only Emulation only No Not needed because we can compute in this dtype No Pending release of B100 gpus for acceleration
Autoquant N/A Yes N/A N/A N/A Supports int8/4. Fp8 coming next

TODO

gau-nernst commented 2 months ago

Small correction. 8-bit and 4-bit optimizers are not exactly INT8 and INT4. They are LUT-based quantization, where the LUT values are defined by Timm Dettmer's "dynamic tree quantization" scheme. (to be even more specific, the 2nd buffer of INT4 optimizer actually uses affine quantization).