Fail to use zero_init to construct llama2 with deepspeed zero3 and qlora!

System Info

bitsandbytes==0.43.1
sentencepiece==0.1.97
huggingface_hub==0.23.2
accelerate==0.30.1
tokenizers==0.19.1
transformers==4.41.1
trl==0.8.6
peft==0.11.1
datasets==2.14.6

Who can help?

@pacman100 @younesbelkada @BenjaminBossan

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder
[X] My own task or dataset (give details below)

Reproduction

When I run the code, parameters first be fully loaded to each GPU and then be sharded. But after I try to use zero3+lora instead zero3+qlora (just remove bnb_config = BitsAndBytesConfig(...) ), it magically worked! Parameters first shard then load to each GPU! So I am confused if bitsandbytes doesn't support zero3_init, or there are some errors in my code. I'll really appreciate if someone can help me!

Here is my code refer to https://huggingface.co/docs/peft/accelerate/deepspeed#use-peft-qlora-and-deepspeed-with-zero3-for-finetuning-large-models-on-multiple-gpus and https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/deepspeed#constructing-massive-models:~:text=If%20you%20want%20to%20use%20a,is%20how%20example%20scripts%20are%20written.:

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'

from datasets import load_dataset
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TrainingArguments
import bitsandbytes as bnb
from peft import LoraConfig
from trl import SFTTrainer

from accelerate import Accelerator
accelerator = Accelerator()

base_model_name ="/home/yangtong/data/llama2-hf/llama2-13b-chat_hf"

dataset = load_dataset("json",data_files="Belle_open_source_0.5M_changed.json",split="train")

result_dir = "tmp"
training_args = TrainingArguments(
    report_to="wandb",
    output_dir=result_dir, 
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-4, 
    logging_steps=10, 
    # max_steps=520,
    num_train_epochs=0.037,
    save_steps=500, 
    bf16 = True, 
    gradient_checkpointing=True
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_use_double_quant=True, 
    bnb_4bit_quant_type="nf4", 
    bnb_4bit_compute_dtype=torch.bfloat16,  
    bnb_4bit_quant_storage=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config, 
    torch_dtype=torch.bfloat16
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

models = ['v_proj', 'gate_proj', 'down_proj', 'k_proj', 'q_proj', 'o_proj', 'up_proj']

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=models
)

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = True
tokenizer.pad_token = tokenizer.eos_token

max_seq_length = 512  
trainer = SFTTrainer(
    model=base_model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_args
)

trainer.train()

output_dir = os.path.join(result_dir, "final_checkpoint")
trainer.model.save_pretrained(output_dir)
# trainer.save_model(output_dir)  # Stage-3

Here is my accelerate config:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: /home/yangtong/ft_dis/ds_config/3.json
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: 'c10d'
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Here is my deepspeedzero3 config:

{
  "optimizer": {
    "type": "AdamW",
    "params": {
        "lr": 2e-4,
        "betas": [
          0.9,
          0.999
        ],
        "eps": "auto",
        "weight_decay": "auto",
        "adam_w_mode": true,
        "torch_adam": true
    }
  },

  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto",
        "total_num_steps": "auto"
    }
  },

  "zero_optimization": {
    "stage": 3,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "offload_optimizer": {
      "device": "none",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "sub_group_size": 1e9,
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {
    "enabled": true
  },
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 16,
  "wall_clock_breakdown": false
}

Here is my launcher context:

accelerate launch \
--config_file "config/z3_3.yaml" \
--num_processes 1 \
ft_acc.py

Expected behavior

Successfully using zero3_init to construct llama2: parameters first sharded and then loaded to GPUs.

Thanks for reporting. I don't have much experience with DeepSpeed, so I haven't encountered the error you mentioned yet. Just a question: Did you try loading the model without LoRA? If not, could you check if this still produces the same error (i.e. bnb preventing zero init)? Probably you won't be able to train without LoRA, but just loading and checking memory should still work.

@BenjaminBossan Thank you for your prompt reply! I have tried your suggestion that run bnb without lora. Unfortunately I got an error during trainer.train() not because of OOM but "ValueError: You cannot perform fine-tuning on purely quantized models. Please attach trainable adapters on top of the quantized model to correctly perform fine-tuning.". It seems like bnb can't be used without adapter. One thing I want to say is in the case of qlora, parameters just be loaded to CPU memory and GPU memory just rise less than 1GB (maybe is the quantization constant) during from_pretrained(). And when run trainer.train() the parameters fully loaded to each GPU then sharded. And in the situation of bnb without lora, the same thing happened during from_trained but because of the ValueError mentioned above, trainer.train() doesn't run and we can't see if it work successfully.

Unfortunately I got an error during trainer.train() not because of OOM but [...]

Ah yes, that makes sense, the quantized weights cannot be used for training. But as I mentioned, this was just for testing the initialization, so the training call can be removed completely. Maybe you can add a time.sleep instead to prevent the script from exiting instantly, so that you can read the memory usage.

One thing I want to say is in the case of qlora, parameters just be loaded to CPU memory and GPU memory just rise less than 1GB (maybe is the quantization constant) during from_pretrained(). And when run trainer.train() the parameters fully loaded to each GPU then sharded.

Thanks for giving more details. What I could imagine is happening is that the LoRA weights are not zero-initialized by DeepSpeed, which is why you see a little bit of extra memory. Maybe you could check the total number of LoRA parameters that are loaded onto your model and calculate the extra memory required by those parameters. If this extra memory matches up with the 1GB that you observed, it makes it quite likely that it's the LoRA adapter.

What you could also try is increase or decrease the rank r in the LoraConfig and check if that affects the amount of extra memory.

Maybe you could check the total number of LoRA parameters that are loaded onto your model and calculate the extra memory required by those parameters. If this extra memory matches up with the 1GB that you observed, it makes it quite likely that it's the LoRA adapter. What you could also try is increase or decrease the rank r in the LoraConfig and check if that affects the amount of extra memory.

I can confirm that the extra memory is not lora adapter because they are loaded during from_pretrained() when the lora adapter hasn't be loaded yet. And after I remove peft_config there is the same result. I use the following code and torch.cuda.memory_reserved() added in the source code to measure memory usage. It changes as the size of the model size changes(llama2_7B will take about 0.58GB and llama2_13B will take about 0.80GB).

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'

from datasets import load_dataset
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TrainingArguments
import bitsandbytes as bnb
from peft import LoraConfig
from trl import SFTTrainer

from accelerate import Accelerator
accelerator = Accelerator()

base_model_name ="/home/yangtong/data/llama2-hf/llama2-7b-chat_hf"

dataset = load_dataset("json",data_files="Belle_open_source_0.5M_changed.json",split="train")

result_dir = "tmp"
training_args = TrainingArguments(
    report_to="wandb",
    output_dir=result_dir, 
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-4, 
    logging_steps=10, 
    # max_steps=520,
    num_train_epochs=0.037,
    save_steps=500, 
    bf16 = True, 
    gradient_checkpointing=True
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_use_double_quant=True, 
    bnb_4bit_quant_type="nf4", 
    bnb_4bit_compute_dtype=torch.bfloat16,  
    bnb_4bit_quant_storage=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config, 
    torch_dtype=torch.bfloat16
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

Ah yes, that makes sense, the quantized weights cannot be used for training. But as I mentioned, this was just for testing the initialization, so the training call can be removed completely.

I also use above code to measure memory used to check if zero_init run successfully. When I use the entire above code, there is only "extra memory" loaded to GPU like the picture below: When I use the code removing bnb_config, there is correct result(7B sharded to 8 GPUs): Now the situation is whether with or without lora, bnb will contribute to incorrect result with zero_init but without bnb it will be success. So maybe the fact is bnb prevent zero_init.

Thanks for investigating further. So this is not a PEFT problem, instead it looks like a bitsandbytes or accelerate or DeepSpeed issue (or a combination of them not working together correctly). You may want to search their issues and create a new one if you don't find a solution there (make sure to remove the PEFT portion to show it's not PEFT-related).

I checked the PEFT docs for DeepSpeed and there my colleague has successfully trained QLoRA with zero3_init_flag: true. However, you mention that this is just about requiring a bit of extra memory, not that it doesn't work at all. So maybe that's the same thing that happened for them.

@BenjaminBossan Thanks for your patient response! That's right and I will open another issue under bnb. And the fact is I can also successfully trained QLoRA with zero3_init_flag: true. The key problem is parameters will be fully loaded to each GPU then sharded, but the correct process is first shard then load the sharded parameters to GPU.

huggingface / peft