Open echo-yi opened 4 weeks ago
@echo-yi Does it work for you with a smaller model, like the example from the PEFT docs?
@matthewdouglas @Titus-von-Koeller Could you please take a look, could it be an issue specifically with Llama 405B?
More context: https://github.com/huggingface/transformers/pull/29587
Disable zero.init when using DeepSpeed with QLoRA.
I wonder if this is still needed?
@BenjaminBossan I tried with "meta-llama/Meta-Llama-3.1-8B-Instruct"
, "meta-llama/Meta-Llama-3.1-70B-Instruct"
and neither worked.
Thanks for testing those. Since this error occurs already at the stage of loading the base model, it is not directly a PEFT error, though of course PEFT is affected and I'd be ready to update the docs if it is confirmed that DS ZeRO3 doesn't work with bnb. I hope the bnb authors can elucidate us.
@tjruwase from deepspeed shared this line, indicating applying both quantization and DS ZeRO3 doesn't work.
shared this line, indicating applying both quantization and DS ZeRO3 doesn't work
Yeah, that was added in the PR I mentioned earlier.
I can confirm that even for smaller models, partitioning does not appear to work. But when I remove quantization and use device_map="auto"
, the same picture emerges. So I'm actually unsure if there is an issue here with bitsandbytes usage in DeepSpeed ZeRO3 or if something else is amiss.
@BenjaminBossan When I remove ZeRO3 and use quantization & device_map="auto"
, partitoning does appear to work.
Also pinging @muellerzr in case he knows something about this.
I tested in axolotl against latest transformers, and this seems to work with this qlora+peft+zero3 yaml
base_model: NousResearch/Meta-Llama-3-8B
load_in_4bit: true
datasets:
- path: tatsu-lab/alpaca
type: alpaca
val_set_size: 0.0
output_dir: ./outputs/lora-out
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
adapter: qlora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
bf16: auto
tf32: false
gradient_checkpointing: true
logging_steps: 1
flash_attention: true
warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 1
deepspeed: deepspeed_configs/zero3_bf16.json
weight_decay: 0.1
special_tokens:
pad_token: <|end_of_text|>
System Info
Who can help?
@stevhliu
Information
Tasks
examples
folderReproduction
This line
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-405B-Instruct", ...)
throws CUDA OOM, because the parameters are not partitoned, but copied across the GPUs.command
accelerate launch --config_file zero3_config.yaml pretrain.py --num_processes=8 --multi_gpu
pretrain.py
zero3_config.yaml
Expected behavior
PEFT QLoRA (with BitsandBytes) and DeepSpeed ZeRO3 are both applied, so that model parameters are quantized and partitoned. I thought this should be working according to this post, but https://github.com/microsoft/DeepSpeed/issues/5819 says BitsandBytes quantization and ZeRO3 are not compatible. If this is the case, I find the above post quite misleading.