aspctu commented 1 year ago

Thanks to this repo, I wanted to just share the LoRa adapters for the 30B model. I used the cleaned dataset. Maybe we can add all of the OSS adapters to the readme @tloen ?

https://huggingface.co/baseten/alpaca-30b

zsc commented 1 year ago

@aspctu How many and what GPUs did you use to run the model inference? For the smaller alpaca13b I kept getting CUDA OOM despite much effort to tweak device_map .

aspctu commented 1 year ago

@zsc I used 1xA100 80GB to train the model. For inference, I used the same instance but it can work on a 40GB A100.

MetaIX commented 1 year ago

Awesome. Do you plan on doing 65B too?

baleksey commented 1 year ago

@aspctu Could you please give more information about the training? How many steps / final loss?

And 65B Alpaca adapter would be great to play with! As we generally don't have so powerful GPU to train it on our own

aspctu commented 1 year ago

@baleksey Yes, definetly.

eval_loss: .8007
train_loss: .778
epochs: 3
steps: 1180

As you can see here, and similar to other LoRa runs on this dataset, the train loss plateaus pretty early but we continue to see improvements in eval loss.

Re: 65B, I'll see if I get a chance to do it this week but I'm more interested in scaling the dataset up and seeing how 30B and smaller variants perform. There are performance gains to be had with a better dataset on the smaller models.

dnhkng commented 1 year ago

Are the weights available?

baleksey commented 1 year ago

@aspctu Thank you for information! I agree, if we could achieve the best possible result on the smaller model it will be the great scenario.

Please, make us updated on your results!

P. S. Thank you for the 30B lora, it works great and definitely more capable then 7B

aspctu commented 1 year ago

@baleksey Yes, absolutely will keep you updated :)

@dnhkng Yes, you can use them from here https://huggingface.co/baseten/alpaca-30b

Qubitium commented 1 year ago

@aspctu I am trying to run generate on 30B on 2x3090 24G but it is not working. Model loads, instruction/prompts are taken but then when it comes to eval, it fails as follows:

EDIT: https://github.com/tloen/alpaca-lora/issues/69 There are issue when model is split to multiple GPUs. 30B cannot run correctly on multiple gpus for now.

CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/anaconda3/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████| 61/61 [00:53<00:00,  1.14it/s]
Type quit or exit to exit this loop
Instruction: Why is the sky blue?
Input (optional):
Traceback (most recent call last):
  File "/root/llama/generatev1.py", line 94, in <module>
    print(evaluate(model, tokenizer, instruction_str, input_str))
  File "/root/llama/generatev1.py", line 66, in evaluate
    generation_output = model.generate(
  File "/root/anaconda3/lib/python3.9/site-packages/peft/peft_model.py", line 581, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 1490, in generate
    return self.beam_search(
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/generation/utils.py", line 2749, in beam_search
    outputs = self(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/anaconda3/lib/python3.9/site-packages/peft/tuners/lora.py", line 522, in forward
    result = super().forward(x)
  File "/root/anaconda3/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/root/anaconda3/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/root/anaconda3/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/root/anaconda3/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 317, in forward
    state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
  File "/root/anaconda3/lib/python3.9/site-packages/bitsandbytes/functional.py", line 1698, in transform
    prev_device = pre_call(A.device)
AttributeError: 'NoneType' object has no attribute 'device'

zsc commented 1 year ago

I also encountered this mysterious "'NoneType' object has no attribute 'device'" bug. My solution is to use export_hf_checkpoint.py to convert the base+LoRA model to a vanilla model, and then use standard hugging face accelerate library for multi-GPU deployment, either in fp16 or int8.

nenkoru commented 1 year ago

I also encountered this mysterious "'NoneType' object has no attribute 'device'" bug. My solution is to use export_hf_checkpoint.py to convert the base+LoRA model to a vanilla model, and then use standard hugging face accelerate library for multi-GPU deployment, either in fp16 or int8.

@zsc Could you please share a snippet of how you do it?

zsc commented 1 year ago

@nenkoru It's really as straight forward as replacing the two digits "7" with "30" in export_hf_checkpoint.py. And when you get the ./hf_ckpt, just point your model.from_pretrained to that directory to load the shiny new all-in-one alpaca-30b.

jgsch commented 1 year ago

I also encountered this mysterious "'NoneType' object has no attribute 'device'" bug. My solution is to use export_hf_checkpoint.py to convert the base+LoRA model to a vanilla model, and then use standard hugging face accelerate library for multi-GPU deployment, either in fp16 or int8.

@zsc Indeed, no more errors with your solution, thanks! But did you manage to make it work in int8? I have the impression that the options load_in_8bit=True + device_map="auto" are incompatible with multi gpu due to a bug, only gpu 0 is filled then we have an OOM if not enough VRAM (like a 3090 card with the 30b model).

zsc commented 1 year ago

@jgsch I made alpaca-30B int8 work for 8x 2080Ti.

All you need is tweaking the max memory per GPU, like https://github.com/zsc/llama_infer/blob/main/test_llam.py#L15 . For alpaca-30B int8 on 8x 2080Ti, I tried limiting gpu0 to "2GiB" and other GPUs to "6GiB", and it is up and running (no problems even for long prompt input, but may be further tweaked when you have batch input).

A little note: I later learned that the max_memory option in from_pretrained is meant for this, so maybe you can further simplify above into from_pretrained(..., max_memory={0: "2GiB", 1: "6GiB", ...}) . Let know the results should you try it.

yfliao commented 1 year ago

@zsc could you give us more details about how to do it? I tried the following codes but they didn't work

tloen/alpaca-lora

BASE_MODEL = "decapoda-research/llama-30b-hf" model = LlamaForCausalLM.from_pretrained( BASE_MODEL, load_in_8bit=True, device_map="auto", max_memory = {0: "20GiB", 1: "20GiB", 2: "20GiB", 3: "20GiB", 4: "20GiB", 5: "20GiB", 6: "20GiB", 7: "20GiB", 8: "20GiB", 9: "20GiB"}, )

nenkoru commented 1 year ago

Okay, I actually managed to load it properly across all the gpus I have. 1) Export to hf checkpoint(so that you have a base model and PEFT on it, and make sure you export the size you want) 2) Use this to load the model in 8bit(use the same base model as in previous step)

path_to_alpaca_hf = "/home/rig-nenkoru/alpaca-lora/hf_ckpt"
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")

model = LlamaForCausalLM.from_pretrained(
        path_to_alpaca_hf,
        load_in_8bit=True,
        torch_dtype=torch.int8,
        device_map="auto",
    )

model.eval()
model = torch.compile(model)

3) Use evaluate and generate_prompt functions from a repository

I can release my jupyter notebook with everything from step 2 done, except that you need to export a checkpoint yourself and provide a path.

As well as that I also managed to convert this checkpoint to int4 https://huggingface.co/nenkoru/alpaca-lora-7b-hf-int4

@zsc, thank you for a fast reply. Went down a rabbit hole and solved this a way that I wanted.

As well as everything outlined above I am working on adding a support in optimum library to export model into ONNX format and use optimizations for fast-inference. That's exactly what I need right now - as fast inference as possible. https://github.com/huggingface/optimum/issues/918

zsc commented 1 year ago

@yfliao Is the error message related to CUDA OOM? If that's the case, you may further tweak your max_memory settings.

I saw that you used a uniform 20GiB limit per GPU, but that does not get the point of manual specifying per GPU limit: we exactly want the first GPU to have a lower cap when loading weights so that there will be more room for activation that will later land in GPU0. So maybe try lower the cap on GPU0 to say 10GiB. When I do this kind of trial-and-error GPU memory tweaking, I keep an eye on watch nvidia-smi|grep Mi to know which GPU got blown up and lower the cap on that GPU.

DarioSucic commented 1 year ago

For anyone else getting AttributeError: 'NoneType' object has no attribute 'device' when running generate.py on the big models, try initializing the model like so:

model = LlamaForCausalLM.from_pretrained(
    model_path,
    load_in_8bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
    max_memory={0: "14GiB", 1: "20GiB"},
    low_cpu_mem_usage=True,
)

device_map = {f"base_model.model.{k}": v for k, v in model.hf_device_map.items()}

model = PeftModel.from_pretrained(
    model,
    trained_path,
    device_map=device_map,
    torch_dtype=torch.float16
)

I could run inference on 33b using this on two 4090s, though it's incredibly slow for some reason.

Qubitium commented 1 year ago

@DarioSucic It's slow because accelerate (device_map=auto) only does vram load spreading. It doesn't actually use both gpus for compute. So by using 2x gpu, you get negative returns since now the 2 gpu has to use the cpu to move data around and not doing real compute work.

tloen / alpaca-lora

Releasing Alpaca 30B adapters #77

tloen/alpaca-lora