johnsmith0031 / alpaca_lora_4bit

MIT License
534 stars 84 forks source link

Can 30b model be trained from 0 on a single 4090 GPU? #27

Open OlegJakushkin opened 1 year ago

OlegJakushkin commented 1 year ago

So I would like to build a helper based on the 30b params model that would fit into 4090, yet train it from 0 to be free from weights licensing legacy. Can this be done, and if yes, how much time could it take on a single 4090 GPU?

johnsmith0031 commented 1 year ago

I think It's impossible to train from 0 because the pretrained weight costs millions of dollars...

s4rduk4r commented 1 year ago

@OlegJakushkin maybe if you'll be able to adapt the techniques from this paper it could be done. If you'll manage to do so - you can publish a paper https://arxiv.org/pdf/2212.14034.pdf

OlegJakushkin commented 1 year ago

@s4rduk4r: Would it not be just like finetuning from all weights initialised at 0/1/random with a larger initial step size? (then the question would be how to initialise all weights to 0 and keep them all 4 bits?)

Or finetune only tunes specific layers?

s4rduk4r commented 1 year ago

@s4rduk4r: Would it not be just like finetuning from all weights initialised at 0/1/random with a larger initial step size? (then the question would be how to initialise all weights to 0 and keep them all 4 bits?)

Or finetune only tunes specific layers?

Finetune is a process of all weights update to produce a special-tailored model to better perform on specific task(s). Training from scratch can be seen as a general case of finetuning, but actually poses greater challenge because of vanishing gradients problem. There are different techniques proposed on how to deal with it, but as I understand it - the deeper the model, the less the actual change of the weights. My wildest guess is that GPTQ's success is somehow related to this phenomena, but I can't prove it so take my words with a grain of salt. What I'm trying to say is that you could make a hybrid approach - uncompressed first layers (don't know how many - you'll need to search in papers or experiment) and quantized subsequent layers. Not sure if the model size still be slim enough to fit into VRAM, probably you'll need to employ some sort of offloading. Maybe others here (or elsewhere) will give better advice.

Also I wouldn't recommend to initialize all weights into zeros, because it won't train. Y = WX, where W - is the weights matrix. If all elements are zero, then Y = 0 regardless of X and gradients are also always 0. You should consult the papers on vanishing gradient problem, they usually consider the best ways of weight initialization

SpaceCowboy850 commented 1 year ago

To add to this, I'm reading the LLama paper now, so this is worth noting:

When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on **2048** A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days

So if you have a few thousand $15000 GPUs lying around and 3 weeks to kill, it's doable. :D