johnsmith0031 / alpaca_lora_4bit

MIT License
534 stars 84 forks source link

Request: Support writing models with multiple loras applied (at 4-bit) #90

Closed tensiondriven closed 1 year ago

tensiondriven commented 1 year ago

Related to #60 , would it be feasible to add the ability to take a base model and a list of loras, and output a merged model with the loras "baked in"? #60 refers to "multi adapter" which I think is closely related, and references the project that should enable this, however that ticket relates to switching the current lora on the fly (e.g. running multiple loras off the same base model in memory at a single time).

I'm currently trying to work around the memory limitation imposed by running loras at inference time. My proposed solution requires being able to write out a safetensors or other format, at 4-bit, that can then be used for inference later.

This capability really expanded what was possible in the Stable Diffusion community and created a whole ecosystem of sharing - I would be quite excited to see it come to alpaca_lora_4bit first!

I may be able to scrape together a PR, but Python is not my first language. If anyone smarter has a bit of time/energy to invest in whipping up a separate python file, or if anyone can provide guidance on what it would take specifically, please let me know!

johnsmith0031 commented 1 year ago

I think to bake in lora, we need the original weight but not the 4-bit one, because 4-bit already have some decrease on its performance due to quantization, If you bake the lora into 4-bit model and quantize it again, maybe the performance would be even worse. Peft already have the feature of baking in lora, seems like the peft.tuners.lora.Linear.eval() method, after that you can just save the whole checkpoint. Also If you want to bake in lora with 4-bit, you can use quant_cuda.vecquant4recons_v2 to reconstruct 4 bit weight in fp16, and then bake in lora, then save, then quantize it again.

tensiondriven commented 1 year ago

Fascinating. I didn't realize that it would be necessary to "rehydrate" the model back to FP16; given that, I agree, it probably makes sense to bake the 4-bit LoRA into a 16 bit version and quantize again.

I'm also wondering if I can use state dicts for this, or if the adapter files (which are created at the end of the process) are somehow better.

I've been chatting with GPT3 a lot and learning a lot today, but there are a lot of concepts and information on the topics exists but is spread out.

The other thing I am wondering is low large the training data sets need to be; I had heard/read that 10MB of text was a good guideline. I recently tried with less than a megabyte, and the training seems to have worked fine. Is there an upper or lower limit on size of training files? (I'm training in text mode.). Anyhoo, thanks for taking the time to reply. What I've given me might be enough to get me unblocked.

johnsmith0031 commented 1 year ago

Any size would work. You can start with small dataset just like the sample dataset in the repo. Try giving something new the the model and test if it learned it from training.

tensiondriven commented 1 year ago

Thanks :) I've run three datasets now, from about 800K all the way up to 10MB. At first I trained for 3 epochs, but found the training wasn't "strong" enough; now I'm running training for 12 epochs. My loss on the 10MB dataset is down to 0.4 with about 5 hours to go (90% complete) - I suspect this will be drastically overfit, but I wanted to get a full-on positive result so I could then "bisect" to get back to a reasonable value.

Regarding the original intent of this ticket, I think it can be closed since you filled me in on a strategy to get this done.

Something else I'm wondering is if it's possible to use the periodic checkpoints for inference? Since we can periodically save checkpoints to resume training, these could also be used to save progress periodically and then use those saved checkpoints to run a training data set and test perplexity, or just ad-hoc test response quality. If this was possible, one could choose a high number of epochs, and then test against each epoch separately without having to run each training set over and over. Do you happen to know if that's possible? So far, from the research I've done, I haven't found a way, but I expect it's simple.

peft.tuners.lora.Linear.eval() quant_cuda.vecquant4recons_v2

I will definitely be taking a look at these functions, thanks so much for the reply.