huggingface / optimum-quanto

A pytorch quantization backend for optimum
Apache License 2.0
817 stars 61 forks source link

Feature Request/Int4 Cuda Kernels #142

Closed NicolasMejiaPetit closed 5 months ago

NicolasMejiaPetit commented 7 months ago

Feature request

There is a GitHub repo out with the necessary kernels and code (and a great paper) to train a transformer based models using int4.

The authors use a couple of algorithms to get around the struggle of quantizing down to int4 including keeping non linear operators in fp16 to avoid certain quant issues, they solve the outlier problem by "propose a Hadamard quantizer (HQ) to solve the outlier problem. Its main idea is to quantize the matrices in another linear space which has fewer outliers." The results they achieved were "We compare the training throughput of the FP16 PyTorch AMP and our INT4 training algorithm for training BERT [24] and GPT [37]-style language models on a system of 8 Nvidia A100 GPUs. We vary the hidden layer size, intermediate fully-connected layer size, and batch size, and plot the speedup of INT4 training in Fig. 5. Our INT4 training algorithm can achieve up to 35.1% speedup for BERT-style models and up to 26.5% speedup for GPT-style models."

This code and paper is for FFT but this same concept could apply directly for Lora and QLora.

Links: Paper Code Either way these int4 coda kernel can help you guys with what you mention in this repos readme " The current implementation however falls back to float32 operations for a lot of operations because of a lack of dedicated kernels (only int8 matrix multiplication is available). " (Actually I was checking through their code and it uses a mix of int32 and int4 through out all the kernels.)

I'd love to see the day we can get the performance numbers advertised by Nvidia with the 3090. Int4 getting over 1 Petaflop across 2 3090's using its native int4 engine(too bad they never gave us any kernels for it, maybe so they could drop fp8 and instantly 'double' preformace). Source

Allegedly vida's cutlass can utilize int4 here is the Blog post Claiming a "59% increase on titan RTX thanks to Turing architectures new int4 precision." and the only related code I could find which funnily enough packs it using int32.

And well PyTorch has a feature request the has been open for the past couple of years for it, some devs work on the torchAO Repo to add int4 and nf4 as a type but again, where this time its packed from int8 which is definitely better than packing it from int32 or fp16 performance wise.

This is essentially a accumulation of all my research into int4 for tranformers I hope any of this helps with getting a int4 dtype, so I hope this helps, both selfishly and unselfishly.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 5 months ago

This issue was closed because it has been stalled for 5 days with no activity.