princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
https://arxiv.org/abs/2310.06694
MIT License
533 stars 39 forks source link

Finetuning using LoRA #25

Closed Nimisha-Pabbichetty closed 8 months ago

Nimisha-Pabbichetty commented 9 months ago

Is it possible to finetune one of your checkpoints using LoRA on a dataset of our choice? If yes, how might we go about doing it?

xiamengzhou commented 9 months ago

Yes! Sheared-LLaMA models are standard huggingface models loaded with Llama model files. You can load the sheared-llama models in huggingface the same as you load standard huggingface models.

from transformers import LlamaForCausalLM
model = LlamaForCausalLM.from_pretrained("princeton-nlp/Sheared-LLaMA-1.3B")

and you can use the peft module to wrap it to be a LoRA model.

Nimisha-Pabbichetty commented 9 months ago

Ah thank you so much! What kind of computer specs do you use to load and train these models? I'm using 2 A100s with 80Gb memory each but keep running into cgroup out of memory errors

xiamengzhou commented 9 months ago

2A100 should be more than sufficient to train 1.3B and 2.7B LoRA models!

Nimisha-Pabbichetty commented 9 months ago

Generally how much GPU compute do you use for training?

xiamengzhou commented 9 months ago

I didn't run any LoRA experiments for the Sheared-LLaMA models. For getting Sheared-LLaMA models, we used full parameter tuning and we used 8 GPUs for pruning and 16 GPUs for continued pre-training.