[HELP] ZeRO3 partition parameters after fully load to each GPU!

CHNRyan commented 5 months ago

Describe the bug I'm fine tuning Llama2 using deepspeed zero3. I found that parameters load to CPU memory during from_pretrained, and at the begining of trainer.train(), params will fully load to each GPU WITHOUT ANY PARTITION. Then they partitioned to GPUs.

To Reproduce Here is my code:

from datasets import load_dataset
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TrainingArguments
import bitsandbytes as bnb
from peft import LoraConfig
from trl import SFTTrainer

base_model_name ="/home/yangtong/data/llama2-hf/llama2-13b-chat_hf"

dataset = load_dataset("json",data_files="Belle_open_source_0.5M_changed.json",split="train")

result_dir = "tmp"
training_args = TrainingArguments(
    report_to="none",
    output_dir=result_dir, 
    # per_device_train_batch_size * gradient_accumulation_steps = batch_size
    per_device_train_batch_size=1, 
    gradient_accumulation_steps=16, 
    learning_rate=2e-4, 
    logging_steps=10, 
    # max_steps=520, 
    num_train_epochs=0.016, 
    save_steps=500, 
    bf16 = True,  # set bf16 to True with an A100
    # optim='paged_adamw_32bit',
    gradient_checkpointing=True
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, 
    bnb_4bit_use_double_quant=True, 
    bnb_4bit_quant_type="nf4", 
    bnb_4bit_compute_dtype=torch.bfloat16, 
)

base_model = LlamaForCausalLM.from_pretrained(
    base_model_name, 
    quantization_config=bnb_config, 
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names: # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)
models=find_all_linear_names(base_model)

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=models
)

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = True
tokenizer.pad_token = tokenizer.eos_token

max_seq_length = 512  
trainer = SFTTrainer(
    model=base_model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_args
)

trainer.train()

output_dir = os.path.join(result_dir, "final_checkpoint")
trainer.model.save_pretrained(output_dir)

Here is my accelerate config:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: /home/yangtong/ft_dis/ds_config/3.json
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: 'c10d'
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Here is my deepspeed config:

{
  "optimizer": {
    "type": "AdamW",
    "params": {
        "lr": 2e-4,
        "betas": [
          0.9,
          0.999
        ],
        "eps": "auto",
        "weight_decay": "auto",
        "adam_w_mode": true,
        "torch_adam": true
    }
  },

  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
        "warmup_min_lr": "auto",
        "warmup_max_lr": "auto",
        "warmup_num_steps": "auto",
        "total_num_steps": "auto"
    }
  },

  "zero_optimization": {
    "stage": 3,
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true,
    "overlap_comm": true,
    "offload_optimizer": {
      "device": "none",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "sub_group_size": 1e9,
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {
    "enabled": true
  },
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 16,
  "wall_clock_breakdown": false
}

Expected behavior Parameters first partition and then load to GPUs.

System info (please complete the following information):

OS: Ubuntu 22.04.4 LTS (Linux 5.15.0-106-generic)
GPU count and types 2 x Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
Python version 3.10.13
Pytorch version 2.2.2
CUDA version 11.8.0
bitsandbytes==0.43.0
huggingface_hub==0.23.2
accelerate==0.30.1
transformers==4.41.1
peft==0.9.0
deepspeed==0.14.0

Launcher context

accelerate launch \
--config_file "config/z3_3.yaml" \
--num_processes 1 \
ft_acc.py

I will truly appreciate if anyone can help me solve it ! @loadams @tjruwase @deepcharm

jomayeri commented 5 months ago

You can try to use zero_init, but I believe for HF the correct method is to place HfDeepSpeedConfig before the from_pretrained method. See this comment and issue thread https://github.com/microsoft/DeepSpeed/issues/3168#issuecomment-1546151533

CHNRyan commented 5 months ago

You can try to use zero_init, but I believe for HF the correct method is to place HfDeepSpeedConfig before the from_pretrained method. See this comment and issue thread #3168 (comment)

@jomayeri Thanks for your reply! I use DS and Accelerate and "zero_init" setted in my Accelerate config "zero3_init_flag: true". And I put the TrainingArguments before from_pretrained to asure I'm using zero_init. But I didn't use HfDeepSpeedConfig, should I use it with HFTrainer?

CHNRyan commented 5 months ago

And I also find that in deepspeed/runtime/engine.py, when run the _configure_distributed_model the is_zero_init_model is judged to False though I use "zero3_init_flag: true". And then all parameters load to GPUs. (Q5D%SD~9W 7O)4}@PY9PIK

jomayeri commented 4 months ago

Hmm, the config might not be passed properly from the Trainer. I'll investigate that. Can you check if adding HfDeepSpeedConfig before from_pretrained works for you?

CHNRyan commented 4 months ago

@jomayeri According to "However, if you want to use DeepSpeed without the [Trainer](https://huggingface.co/docs/transformers/v4.41.3/en/main_classes/trainer#transformers.Trainer), Transformers provides a HfDeepSpeedConfig class." in https://huggingface.co/docs/transformers/main_classes/deepspeed, I think it don't need to add HfDeepSpeedConfig because once I set zero3 before from_pretrained(), trainer or trl will do correct things automatically. And I find that maybe is bnb prevent zero_init because it runs successfully after I remove bnb_config.

CHNRyan commented 4 months ago

@jomayeri Maybe you can see https://github.com/microsoft/DeepSpeed/issues/5660 for more details. Thanks!

jomayeri commented 4 months ago

Closing in favor of https://github.com/microsoft/DeepSpeed/issues/5660

microsoft / DeepSpeed

[HELP] ZeRO3 partition parameters after fully load to each GPU! #5617