Open NicolasMejiaPetit opened 7 months ago
Thanks for the detailed feature request! Might be better for the nanotron library. Just FYI @younesbelkada
Thanks for the detailed feature request! Might be better for the nanotron library. Just FYI @younesbelkada
Awesome thank you, I'll share the code with the folks over at the quanto repo, and the nanotron repo
Feature request
There is a GitHub repo out with the necessary kernels and code (and a great paper) to train a transformer based models using int4.
The authors use a couple of algorithms to get around the struggle of quantizing down to int4 including keeping non linear operators in fp16 to avoid certain quant issues, they solve the outlier problem by "propose a Hadamard quantizer (HQ) to solve the outlier problem. Its main idea is to quantize the matrices in another linear space which has fewer outliers." The results they achieved were "We compare the training throughput of the FP16 PyTorch AMP and our INT4 training algorithm for training BERT [24] and GPT [37]-style language models on a system of 8 Nvidia A100 GPUs. We vary the hidden layer size, intermediate fully-connected layer size, and batch size, and plot the speedup of INT4 training in Fig. 5. Our INT4 training algorithm can achieve up to 35.1% speedup for BERT-style models and up to 26.5% speedup for GPT-style models."
These results are without using Flash Attention which would increase gains further, and you could use the Galore 8bit optimizer, or better yet Deep speeds 1bit Adam optimizer, fully offloaded to the CPU. For optimized full fine tuning of large 7b models on consumer hardware.
This code and paper is for FFT but this same concept could apply directly for Lora and QLora.
Links: Paper Code
Motivation
Having int4 as a trainable dtype would provide a ton of utility. On 2 consumer 3090’s you get 1 TFLOP of compute according to Nvidia’s documentation. It would increase the training possibilities for the gpu poor. And significantly increase training speed on for server applications.
Your contribution
Gathered information. I’m not very good at coding, at least not good enough to add to transformer repo. This might be too long of an endeavor, so if it is sorry for wasting your time, and we can close this feature request.