4bit or 8bit quantization?

Tizzzzy commented 4 months ago

Hi, Huge fan of your work. I was wondering in your code are you using a 4bit or 8bit quantization lora?

shashank140195 commented 4 months ago

Hi, Huge fan of your work. I was wondering in your code are you using a 4bit or 8bit quantization lora?

Hi, Thank you. It is 8 bit quantization

Tizzzzy commented 4 months ago

Thank you for responding. I'm curious about how long did you spend to train your data. Additionally, can your code be adapted for multi-GPU? If feasible, could you kindly provide the code?

shashank140195 commented 4 months ago

Thank you for responding. I'm curious about how long did you spend to train your data. Additionally, can your code be adapted for multi-GPU? If feasible, could you kindly provide the code?

Since I finetuned LLama2-7B on 1 A100 (40GB), I did not use support for multi-GPU processing. My code can be found here https://github.com/shashank140195/finetune_Llama2_LORA/blob/main/scripts/finetune.py

Tizzzzy commented 4 months ago

Hi, after I ran the exact code only with different dataset, I am keep getting this error. I don't know where went wrong, since I followed the exact process as you and I ran in a single A100 with 80G GPU RAM.

Can you please take a look, Thank you so much

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.16s/it]
/opt/conda/lib/python3.10/site-packages/peft/utils/other.py:143: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
  warnings.warn(
trainable params: 1,415,577,600 || all params: 8,153,993,216 || trainable%: 17.360544245024794
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
  0%|                                                                                                                                                         | 0/233 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Traceback (most recent call last):
  File "//finetune_bloomberg.py", line 149, in <module>
    trainer.train()
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1606, in train
    return inner_training_loop(
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1942, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2873, in training_step
    loss = self.compute_loss(model, inputs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2896, in compute_loss
    outputs = model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 1083, in forward
    return self.base_model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 161, in forward
    return self.model.forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1148, in forward
    outputs = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 964, in forward
    causal_mask = self._update_causal_mask(attention_mask, inputs_embeds)
  File "/opt/conda/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1055, in _update_causal_mask
    padding_mask = causal_mask[..., :mask_length].eq(0.0) * attention_mask[:, None, None, :].eq(0.0)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
  0%|          | 0/233 [00:01<?, ?it/s]

shashank140195 / finetune_Llama2_LORA

4bit or 8bit quantization? #1