off topic but.. we need a johnsmith0031 of Vicuna

ehartford commented 1 year ago

The world needs a johnsmith0031 of Vicuna to make 4-bit LoRA finetuning and make flash attention work on 4090. right now training vicuna takes 4x A100 40GB https://github.com/lm-sys/FastChat#fine-tuning-vicuna-7b-with-local-gpus Just wondering if there are any efforts out there? Or ambitions even?

tensiondriven commented 1 year ago

Is it currently possible to make Lora's against 4-but quantizations of anything? And would this be specific to 4090 or any card with tensor cores and enough Ram?

ehartford commented 1 year ago

yes; I do it every day, using this very repo. I think if you get it working on 4090 it will also work on 3090 just slower. I think that should be the target and if lower cards are enabled that's bonus

tensiondriven commented 1 year ago

What size models are you training? Can you paste a link to the base model you're fine-tuning?

I'd like to hear about your workflow; when you say you do it every day, are you updating a personal assistant / companion model with experiences from each day, or is it for something else?

tensiondriven commented 1 year ago

Vicuña and Alpaca are both fine-tunes of Llama, so as I understand it, if you're able to make a lora of an Alpaca model, you should be able to make a lora of a Vicuña model. Curious if you've tried this.

I've been able to make a lora of Llama 8-bit, and had planned to add a step to apply the lora to generate an 8-bit model with the LoRA embedded in the model, and then quantize it, but if I can take a 4-but llama/alpaca/vicuña and make a LoRA for that, and then apply that LoRA, preserving the base model, I would much rather do that.

ehartford commented 1 year ago

What size models are you training? Can you paste a link to the base model you're fine-tuning?

I'd like to hear about your workflow; when you say you do it every day, are you updating a personal assistant / companion model with experiences from each day, or is it for something else?

The base models I am using are here:

https://huggingface.co/Neko-Institute-of-Science/LLaMA-7B-4bit-128g https://huggingface.co/Neko-Institute-of-Science/LLaMA-13B-4bit-128g https://huggingface.co/Neko-Institute-of-Science/LLaMA-30B-4bit-128g

Though Guru made a better one here but it's not 4-bit and I haven't had time to learn how to quantize models. https://huggingface.co/yahma/llama-7b-hf https://huggingface.co/yahma/llama-13b-hf

I can't finetune 65b yet, I hope I can after I get my dual-3090 system set up. If I can I absolutely will do it, at least once.

Every day means - well it takes 30-70 hours to fine tune a model on my 4090. So, I work on my dataset while it's training. then after it finishes I load up another job, usually with an improved dataset.

johnsmith0031 commented 1 year ago

Use this for supporting other 4-bit models.

def load_model_4bit_gptq(model_name):
    config_path = str(Path(f'{shared.args.model_dir}/{model_name}'))
    model_path = str(find_quantized_model_file(model_name))

    model, tokenizer = load_model_4bit_low_ram(config_path, model_path, groupsize=shared.args.groupsize, is_v1_model=shared.args.is_v1_model)

    # check if model is llama
    if 'llama' in str(type(model)).lower():
        print(Style.BRIGHT + Fore.CYAN + "Model is Llama. Applying Llama specific patches ...")
        try:
            tokenizer.eos_token_id = 2
            tokenizer.bos_token_id = 1
            tokenizer.pad_token_id = 0
        except:
            pass

    return model, tokenizer

def load_model_4bit_low_ram(config_path, model_path, groupsize=-1, device_map="auto", is_v1_model=False):

    print(Style.BRIGHT + Fore.CYAN + "Loading Model ...")
    t0 = time.time()

    with accelerate.init_empty_weights():
        config = AutoConfig.from_pretrained(config_path)
        model = AutoModelForCausalLM.from_config(config)
        model = model.eval()
        layers = find_layers(model)
        for name in ['lm_head']:
            if name in layers:
                del layers[name]
        make_quant_for_4bit_autograd(model, layers, groupsize=groupsize, is_v1_model=is_v1_model)
    model = accelerate.load_checkpoint_and_dispatch(
        model=model,
        checkpoint=model_path,
        device_map=device_map
    )

    for n, m in model.named_modules():
        if isinstance(m, Autograd4bitQuantLinear):
            if m.is_v1_model:
                m.zeros = m.zeros.half()
            m.scales = m.scales.half()
            m.bias = m.bias.half()

    try:
        tokenizer = AutoTokenizer.from_pretrained(config_path)
    except HFValidationError as e:
        tokenizer = AutoTokenizer.from_pretrained(model)
    tokenizer.truncation_side = 'left'

    print(Style.BRIGHT + Fore.GREEN + f"Loaded the model in {(time.time()-t0):.2f} seconds.")

    return model, tokenizer

Ph0rk0z commented 1 year ago

The offloading works too if you genericize it. Probably the training would and then any model can be finetuned.

afaik, these are the non splitable layers

no_split_module_classes=["LlamaDecoderLayer", "GPTJBlock", "OPTDecoderLayer", "GPTNeoXLayer"]

johnsmith0031 / alpaca_lora_4bit

off topic but.. we need a johnsmith0031 of Vicuna #84