johnsmith0031 / alpaca_lora_4bit

MIT License
533 stars 84 forks source link

off topic but.. we need a johnsmith0031 of Vicuna #84

Open ehartford opened 1 year ago

ehartford commented 1 year ago

The world needs a johnsmith0031 of Vicuna to make 4-bit LoRA finetuning and make flash attention work on 4090. right now training vicuna takes 4x A100 40GB Just wondering if there are any efforts out there? Or ambitions even?

tensiondriven commented 1 year ago

Is it currently possible to make Lora's against 4-but quantizations of anything? And would this be specific to 4090 or any card with tensor cores and enough Ram?

ehartford commented 1 year ago

yes; I do it every day, using this very repo. I think if you get it working on 4090 it will also work on 3090 just slower. I think that should be the target and if lower cards are enabled that's bonus

tensiondriven commented 1 year ago

What size models are you training? Can you paste a link to the base model you're fine-tuning?

I'd like to hear about your workflow; when you say you do it every day, are you updating a personal assistant / companion model with experiences from each day, or is it for something else?

tensiondriven commented 1 year ago

Vicuña and Alpaca are both fine-tunes of Llama, so as I understand it, if you're able to make a lora of an Alpaca model, you should be able to make a lora of a Vicuña model. Curious if you've tried this.

I've been able to make a lora of Llama 8-bit, and had planned to add a step to apply the lora to generate an 8-bit model with the LoRA embedded in the model, and then quantize it, but if I can take a 4-but llama/alpaca/vicuña and make a LoRA for that, and then apply that LoRA, preserving the base model, I would much rather do that.

ehartford commented 1 year ago

What size models are you training? Can you paste a link to the base model you're fine-tuning?

I'd like to hear about your workflow; when you say you do it every day, are you updating a personal assistant / companion model with experiences from each day, or is it for something else?

The base models I am using are here:

Though Guru made a better one here but it's not 4-bit and I haven't had time to learn how to quantize models.

I can't finetune 65b yet, I hope I can after I get my dual-3090 system set up. If I can I absolutely will do it, at least once.

Every day means - well it takes 30-70 hours to fine tune a model on my 4090. So, I work on my dataset while it's training. then after it finishes I load up another job, usually with an improved dataset.

johnsmith0031 commented 1 year ago

Use this for supporting other 4-bit models.

def load_model_4bit_gptq(model_name):
    config_path = str(Path(f'{shared.args.model_dir}/{model_name}'))
    model_path = str(find_quantized_model_file(model_name))

    model, tokenizer = load_model_4bit_low_ram(config_path, model_path, groupsize=shared.args.groupsize, is_v1_model=shared.args.is_v1_model)

    # check if model is llama
    if 'llama' in str(type(model)).lower():
        print(Style.BRIGHT + Fore.CYAN + "Model is Llama. Applying Llama specific patches ...")
            tokenizer.eos_token_id = 2
            tokenizer.bos_token_id = 1
            tokenizer.pad_token_id = 0

    return model, tokenizer

def load_model_4bit_low_ram(config_path, model_path, groupsize=-1, device_map="auto", is_v1_model=False):

    print(Style.BRIGHT + Fore.CYAN + "Loading Model ...")
    t0 = time.time()

    with accelerate.init_empty_weights():
        config = AutoConfig.from_pretrained(config_path)
        model = AutoModelForCausalLM.from_config(config)
        model = model.eval()
        layers = find_layers(model)
        for name in ['lm_head']:
            if name in layers:
                del layers[name]
        make_quant_for_4bit_autograd(model, layers, groupsize=groupsize, is_v1_model=is_v1_model)
    model = accelerate.load_checkpoint_and_dispatch(

    for n, m in model.named_modules():
        if isinstance(m, Autograd4bitQuantLinear):
            if m.is_v1_model:
                m.zeros = m.zeros.half()
            m.scales = m.scales.half()
            m.bias = m.bias.half()

        tokenizer = AutoTokenizer.from_pretrained(config_path)
    except HFValidationError as e:
        tokenizer = AutoTokenizer.from_pretrained(model)
    tokenizer.truncation_side = 'left'

    print(Style.BRIGHT + Fore.GREEN + f"Loaded the model in {(time.time()-t0):.2f} seconds.")

    return model, tokenizer
Ph0rk0z commented 1 year ago

The offloading works too if you genericize it. Probably the training would and then any model can be finetuned.

afaik, these are the non splitable layers

no_split_module_classes=["LlamaDecoderLayer", "GPTJBlock", "OPTDecoderLayer", "GPTNeoXLayer"]