Open ehartford opened 1 year ago
Is it currently possible to make Lora's against 4-but quantizations of anything? And would this be specific to 4090 or any card with tensor cores and enough Ram?
yes; I do it every day, using this very repo. I think if you get it working on 4090 it will also work on 3090 just slower. I think that should be the target and if lower cards are enabled that's bonus
What size models are you training? Can you paste a link to the base model you're fine-tuning?
I'd like to hear about your workflow; when you say you do it every day, are you updating a personal assistant / companion model with experiences from each day, or is it for something else?
Vicuña and Alpaca are both fine-tunes of Llama, so as I understand it, if you're able to make a lora of an Alpaca model, you should be able to make a lora of a Vicuña model. Curious if you've tried this.
I've been able to make a lora of Llama 8-bit, and had planned to add a step to apply the lora to generate an 8-bit model with the LoRA embedded in the model, and then quantize it, but if I can take a 4-but llama/alpaca/vicuña and make a LoRA for that, and then apply that LoRA, preserving the base model, I would much rather do that.
What size models are you training? Can you paste a link to the base model you're fine-tuning?
I'd like to hear about your workflow; when you say you do it every day, are you updating a personal assistant / companion model with experiences from each day, or is it for something else?
The base models I am using are here:
https://huggingface.co/Neko-Institute-of-Science/LLaMA-7B-4bit-128g https://huggingface.co/Neko-Institute-of-Science/LLaMA-13B-4bit-128g https://huggingface.co/Neko-Institute-of-Science/LLaMA-30B-4bit-128g
Though Guru made a better one here but it's not 4-bit and I haven't had time to learn how to quantize models. https://huggingface.co/yahma/llama-7b-hf https://huggingface.co/yahma/llama-13b-hf
I can't finetune 65b yet, I hope I can after I get my dual-3090 system set up. If I can I absolutely will do it, at least once.
Every day means - well it takes 30-70 hours to fine tune a model on my 4090. So, I work on my dataset while it's training. then after it finishes I load up another job, usually with an improved dataset.
Use this for supporting other 4-bit models.
def load_model_4bit_gptq(model_name):
config_path = str(Path(f'{shared.args.model_dir}/{model_name}'))
model_path = str(find_quantized_model_file(model_name))
model, tokenizer = load_model_4bit_low_ram(config_path, model_path, groupsize=shared.args.groupsize, is_v1_model=shared.args.is_v1_model)
# check if model is llama
if 'llama' in str(type(model)).lower():
print(Style.BRIGHT + Fore.CYAN + "Model is Llama. Applying Llama specific patches ...")
try:
tokenizer.eos_token_id = 2
tokenizer.bos_token_id = 1
tokenizer.pad_token_id = 0
except:
pass
return model, tokenizer
def load_model_4bit_low_ram(config_path, model_path, groupsize=-1, device_map="auto", is_v1_model=False):
print(Style.BRIGHT + Fore.CYAN + "Loading Model ...")
t0 = time.time()
with accelerate.init_empty_weights():
config = AutoConfig.from_pretrained(config_path)
model = AutoModelForCausalLM.from_config(config)
model = model.eval()
layers = find_layers(model)
for name in ['lm_head']:
if name in layers:
del layers[name]
make_quant_for_4bit_autograd(model, layers, groupsize=groupsize, is_v1_model=is_v1_model)
model = accelerate.load_checkpoint_and_dispatch(
model=model,
checkpoint=model_path,
device_map=device_map
)
for n, m in model.named_modules():
if isinstance(m, Autograd4bitQuantLinear):
if m.is_v1_model:
m.zeros = m.zeros.half()
m.scales = m.scales.half()
m.bias = m.bias.half()
try:
tokenizer = AutoTokenizer.from_pretrained(config_path)
except HFValidationError as e:
tokenizer = AutoTokenizer.from_pretrained(model)
tokenizer.truncation_side = 'left'
print(Style.BRIGHT + Fore.GREEN + f"Loaded the model in {(time.time()-t0):.2f} seconds.")
return model, tokenizer
The offloading works too if you genericize it. Probably the training would and then any model can be finetuned.
afaik, these are the non splitable layers
no_split_module_classes=["LlamaDecoderLayer", "GPTJBlock", "OPTDecoderLayer", "GPTNeoXLayer"]
The world needs a johnsmith0031 of Vicuna to make 4-bit LoRA finetuning and make flash attention work on 4090. right now training vicuna takes 4x A100 40GB https://github.com/lm-sys/FastChat#fine-tuning-vicuna-7b-with-local-gpus Just wondering if there are any efforts out there? Or ambitions even?