Aillian commented 1 year ago

The bug: script just hangs when it starts training the model after loading cpu_adam op...

i have noticed that the same issue is happening to many people, i tried many solutions...

2176 suggests to `console NCCL_P2P_DISABLE=1`

3416 suggests to `rm -rf /home/ga2530/.cache/torch_extensions/py310_cu116`

4285 suggests to change `TORCH_EXTENSIONS_DIR` environment variable

but nothing works...

Here is the script output:

Loading extension module cpu_adam...
Time to load cpu_adam op: 39.9292848110199 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 39.927759408950806 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 39.955281257629395 seconds
Parameter Offload: Total persistent parameters: 11800576 in 417 params
  0%|                                                                                               | 0/6 [00:00<?, ?it/s]
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  0%|                                                                                               | 0/6 [00:02<?, ?it/s]
  0%|                                                                                               | 0/12 [00:00<?, ?it/s]

Here is the deepspeed_config.yaml file:

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_config_file: ds_config.json
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
use_cpu: false

Here is the ds_config.json file:

{
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 30,
    "wall_clock_breakdown": false,
    "dump_state": true,

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "fp16": {
        "enabled": "auto"
    },

    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,

        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },

        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_gather_16bit_weights_on_model_save": true
    },

"tensorboard": {
    "enabled": true,
    "output_path": "logs/",
    "job_name": "train_daeya_llm"
    },

"csv_monitor": {
    "enabled": true,
    "output_path": "logs/",
    "job_name": "train_daeya_llm"
    }

}

Here is my code:

# %%
import pandas as pd
import os
import pickle
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer, TrainingArguments, AutoConfig, DataCollatorForLanguageModeling
import transformers
import bitsandbytes as bnb
from trl import SFTTrainer
from peft import LoraConfig, get_peft_model, PeftModel, prepare_model_for_kbit_training
from pprint import pprint
import argparse
from accelerate import Accelerator

# %%
def parse_arge():
    """Parse the arguments."""
    parser = argparse.ArgumentParser()
    # model args
    parser.add_argument("--model_id", type=str, default="meta-llama/Llama-2-70b-hf",
                        help="Model id to use for training.")
    parser.add_argument("--load_in_8bit", type=bool,
                        default=False, help="Load model in 8 bits.")
    parser.add_argument("--load_in_4bit", type=bool,
                        default=True, help="Load model in 4 bits.")
    parser.add_argument("--bnb_4bit_use_double_quant", type=bool,
                        default=True, help="Use 4 bits double quantization.")
    parser.add_argument("--bnb_4bit_quant_type", type=str,
                        default="nf4", help="4 bits quantization type.")
    parser.add_argument("--gradient_checkpointing_enable", type=bool,
                        default=False, help="Enable gradient checkpointing.")

    # lora args
    parser.add_argument("--lora_r", type=int, default=8, help="Lora rank.")
    parser.add_argument("--lora_alpha", type=int,
                        default=16, help="Lora alpha.")
    parser.add_argument("--lora_dropout", type=float,
                        default=0.05, help="Lora dropout.")

    # dataset args
    parser.add_argument("--dataset_path", type=str, default="Chats_Dataset_Cleaned_(without_system)",
                        help="Path to the already processed dataset.")

    # training args
    parser.add_argument("--epochs", type=int, default=3,
                        help="Number of epochs to train for.")
    parser.add_argument("--gradient_accumulation_steps",
                        type=int, default=4, help="Gradient accumulation steps.")
    parser.add_argument("--auto_find_batch_size", type=bool,
                        default=True, help="Batch size to use for testing.")
    parser.add_argument("--lr", type=float, default=5e-5,
                        help="Learning rate to use for training.")
    parser.add_argument("--optimizer", type=str,
                        default='paged_adamw_32bit', help="optimizer.")
    parser.add_argument("--fp16", type=bool, default=True, help="Use fp16.")
    parser.add_argument("--bf16", type=bool, default=False, help="Use bf16.")
    parser.add_argument("--deepspeed", type=str, default='ds_config.json',
                        help="Path to deepspeed config file.")

    args = parser.parse_known_args()
    return args

# %%
def load_and_prepare_data(dataset_path):

    print('############################################## LOADING AND PREPARING DATA ##############################################')

    df = pd.read_csv(dataset_path)

    df = df.head(300)

    train_df, test_df = train_test_split(df, test_size=0.05, random_state=77)

    ds = DatasetDict({
        "train": Dataset.from_pandas(train_df),
        "test": Dataset.from_pandas(test_df)})
    return ds

# %%
def load_and_prepare_model_and_tokenizer(epochs, gradient_accumulation_steps, auto_find_batch_size, lr, optimizer, fp16, bf16, deepspeed, model_id, load_in_8bit, load_in_4bit, bnb_4bit_use_double_quant, bnb_4bit_quant_type, gradient_checkpointing_enable):

    print('############################################## LOADING AND PREPARING MODEL & TOKENIZER ##############################################')

    if load_in_8bit:
        bnb_config = BitsAndBytesConfig(load_in_8bit=load_in_8bit)
    else:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=load_in_4bit,
            bnb_4bit_use_double_quant=bnb_4bit_use_double_quant,
            bnb_4bit_quant_type=bnb_4bit_quant_type,
            bnb_4bit_compute_dtype=torch.float16)

    device_index = Accelerator().process_index
    device_map = {"": device_index}

    training_arguments = TrainingArguments(
        gradient_accumulation_steps=gradient_accumulation_steps,
        warmup_steps=2,
        # max_steps=500,
        # per_device_train_batch_size=8,
        # per_device_eval_batch_size=8,
        auto_find_batch_size=auto_find_batch_size,
        num_train_epochs=epochs,
        learning_rate=lr,
        lr_scheduler_type="cosine",
        do_eval=True,
        logging_dir='logs',
        logging_strategy='steps',
        logging_steps=20,
        save_strategy='steps',
        save_steps=40,
        optim=optimizer,
        max_grad_norm=0.3,
        warmup_ratio=0.03,
        weight_decay=0.001,
        remove_unused_columns=True,
        evaluation_strategy="steps",
        load_best_model_at_end=True,
        fp16=fp16,
        bf16=bf16,  # set True with A100
        group_by_length=True,
        save_total_limit=10,
        seed=77,
        output_dir="checkpoints",
        overwrite_output_dir=True,
        report_to='tensorboard',
        deepspeed=deepspeed
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        trust_remote_code=True,
        torch_dtype=torch.float16,
        # device_map=device_map,
        token='XXX')

    model.config.use_cache = False

    tokenizer = AutoTokenizer.from_pretrained(
        model_id, trust_remote_code=True, token='XXX')
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))

    if gradient_checkpointing_enable:
        model.gradient_checkpointing_enable()

    return model, tokenizer, training_arguments

# %%
def prepare_trainer(training_arguments, lora_r, lora_alpha, lora_dropout, model, model_id, load_in_4bit, tokenizer, dataset):

    print('############################################## PREPARING TRAINER ##############################################')

    peft_config = LoraConfig(
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        r=lora_r,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=['down_proj', 'up_proj', 'v_proj', 'k_proj', 'gate_proj', 'o_proj', 'q_proj'])

    def print_trainable_parameters(model, use_4bit=False):
        """
        Prints the number of trainable parameters in the model.
        """
        trainable_params = 0
        all_param = 0
        for _, param in model.named_parameters():
            num_params = param.numel()
            # if using DS Zero 3 and the weights are initialized empty
            if num_params == 0 and hasattr(param, "ds_numel"):
                num_params = param.ds_numel

            all_param += num_params
            if param.requires_grad:
                trainable_params += num_params
        if use_4bit:
            trainable_params = int(trainable_params/2)
        print(
            f"All Params: {all_param:,d} || Trainable Params: {trainable_params:,d} || Trainable (%): {100 * trainable_params / all_param}")

    print_trainable_parameters(model, use_4bit=load_in_4bit)

    config_for_max_length = AutoConfig.from_pretrained(model_id)

    max_length = config_for_max_length.max_position_embeddings

    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset['train'],
        eval_dataset=dataset['test'],
        peft_config=peft_config,
        dataset_text_field="chat",
        packing=False,
        max_seq_length=max_length,
        tokenizer=tokenizer,
        args=training_arguments,
        data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False, pad_to_multiple_of=8))

    for name, module in trainer.model.named_modules():
        if "norm" in name:
            module = module.to(torch.float32)

    return trainer

# %%
def train_model():

    args, _ = parse_arge()

    # model args
    model_id = args.model_id
    load_in_8bit = args.load_in_8bit
    load_in_4bit = args.load_in_4bit
    bnb_4bit_use_double_quant = args.bnb_4bit_use_double_quant
    bnb_4bit_quant_type = args.bnb_4bit_quant_type
    gradient_checkpointing_enable = args.gradient_checkpointing_enable
    # lora args
    lora_r = args.lora_r
    lora_alpha = args.lora_alpha
    lora_dropout = args.lora_dropout
    # dataset args
    dataset_path = args.dataset_path
    # training args
    epochs = args.epochs
    gradient_accumulation_steps = args.gradient_accumulation_steps
    auto_find_batch_size = args.auto_find_batch_size
    lr = args.lr
    optimizer = args.optimizer
    fp16 = args.fp16
    bf16 = False
    deepspeed = args.deepspeed

    dataset = load_and_prepare_data(dataset_path)

    model, tokenizer, training_arguments = load_and_prepare_model_and_tokenizer(epochs, gradient_accumulation_steps, auto_find_batch_size, lr, optimizer, fp16, bf16, deepspeed, model_id, load_in_8bit, load_in_4bit, bnb_4bit_use_double_quant, bnb_4bit_quant_type, gradient_checkpointing_enable)

    trainer = prepare_trainer(training_arguments, lora_r, lora_alpha, lora_dropout, model, model_id, load_in_4bit, tokenizer, dataset)

    print('############################################## TRAINING MODEL ##############################################')

    trainer.train()

    trainer.evaluate()

    if not os.path.exists("Model_Results"):
        os.makedirs("Model_Results")
    trainer.save_model("Model_Results")

# %%
def main():
    args, _ = parse_arge()
    train_model()

# %%
if __name__ == "__main__":
    main()

Here is my launcher:

accelerate launch --config_file deepspeed_config.yaml LLAMA_2_70B_Fine_Tuning_Deepspeed.py \
--model_id meta-llama/Llama-2-7b-hf \
--load_in_8bit True \
--load_in_4bit False \
--bnb_4bit_use_double_quant False \
--bnb_4bit_quant_type nf4 \
--gradient_checkpointing_enable True \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--dataset_path Chats_Dataset_Cleaned_without_system.csv \
--epochs 2 \
--gradient_accumulation_steps 4 \
--auto_find_batch_size True \
--lr 5e-5 \
--optimizer paged_adamw_8bit \
--fp16 True \
--bf16 False \
--deepspeed ds_config.json

OR

deepspeed LLAMA_2_70B_Fine_Tuning_Deepspeed.py \
--model_id meta-llama/Llama-2-7b-hf \
--load_in_8bit True \
--load_in_4bit False \
--bnb_4bit_use_double_quant False \
--bnb_4bit_quant_type nf4 \
--gradient_checkpointing_enable True \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.05 \
--dataset_path Chats_Dataset_Cleaned_without_system.csv \
--epochs 2 \
--gradient_accumulation_steps 4 \
--auto_find_batch_size True \
--lr 5e-5 \
--optimizer paged_adamw_8bit \
--fp16 True \
--bf16 False \
--deepspeed ds_config.json

Expected behavior: Model starts to train

ds_report output:

[2023-10-03 14:45:31,556] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.10/dist-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/usr/local/lib/python3.10/dist-packages/deepspeed']
deepspeed info ................... 0.10.3, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
shared memory (/dev/shm) size .... 94.00 GB

System info:

OS: Ubuntu 22.04.2 LTS
GPU count and types: 3X RTX A6000
Python version: 3.10.6
Running this setup on runpod.io

Launcher: Both deepspeed and accelerate accelauncher

jomayeri commented 1 year ago

@Aillian Based on the screenshot it looks to be stuck downloading the Llama model. Did the download every complete?

Aillian commented 1 year ago

Yes, based on the progress bar output from 'AutoModelForCausalLM.from_pretrained' it shows that the model download is completed...

Update: when i train on a single gpu i get: "AttributeError: 'NoneType' object has no attribute 'backward' "

Maybe that would help in debugging the issue

Deema2 commented 1 year ago

I am facing the same issue! Help please

FaresDev8 commented 1 year ago

I have encountered the same issue with no luck fixing it, any help would be appreciated.

jomayeri commented 1 year ago

It gets past the stuck point on a single GPU?

jzm930203 commented 1 year ago

same problem and trys NCCL_P2P_DISABLE=1 as well ，but nothing changed， debug find ,it hangs at any help would be appreciated. PS : i use qwen project for 2-nodes and 4-gpu training. torch 1.13.1 tqdm 4.66.1 transformers 4.32.0 deepspeed 0.11.1 accelerate 0.24.0

mhillebrand commented 1 year ago

I just started encountering this same issue. It only happens when I try to finetune a Llama2 derivative but not when I finetune Mistral or Zephyr. Hmmm.

Traceback (most recent call last):                                                                                   | 0/595 [00:00<?, ?it/s]
  File "/home/matt/topics/finetune.py", line 23, in <module>
    trainer.train()
  File "/home/matt/lora/lora_trainer.py", line 115, in train
    self.trainer.train()
  File "/home/matt/miniconda3/envs/lora/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 290, in train
    output = super().train(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/miniconda3/envs/lora/lib/python3.11/site-packages/transformers/trainer.py", line 1556, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/miniconda3/envs/lora/lib/python3.11/site-packages/accelerate/utils/memory.py", line 136, in decorator
    return function(batch_size, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/miniconda3/envs/lora/lib/python3.11/site-packages/transformers/trainer.py", line 1872, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/miniconda3/envs/lora/lib/python3.11/site-packages/transformers/trainer.py", line 2748, in training_step
    self.accelerator.backward(loss)
  File "/home/matt/miniconda3/envs/lora/lib/python3.11/site-packages/accelerate/accelerator.py", line 1980, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'backward'

Python 3.11.4 CUDA 12.1.1 torch 2.1.1 transformers 4.35.2 peft 0.6.2 trl 0.7.4 accelerate 0.24.1 deepspeed 0.12.3

mhillebrand commented 1 year ago

A workaround appears to be disabling auto_find_batch_size. 🙂

patrick-tssn commented 7 months ago

Could anyone provide an update or a solution to the issue at hand?

eslambakr commented 7 months ago

Any updates regarding this issue?

Aillian commented 7 months ago

Very poor support for this repo, very disappointed...

jwh97nn commented 6 months ago

Based on my experience, I suggest deleting the folder .cache/torch_extensions/py310_cu116 and then reinstalling deepspeed.

7-Z-7 commented 5 months ago

Same issue, only can work when set Zero0, but Zero2 and 3 failed

Yiqiu-Zhang commented 4 months ago

First, I deleted ~/.cache/torch_extensions/py310_cu116. it worked for a moment and hangs at anather place. Then I reinstall deepspeed, and now it works.

SystemErrorWang commented 4 months ago

same here, my training got stuck after 4 steps, and showing this info: [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912, reducing to 268435456

microsoft / DeepSpeed

[BUG] Training gets stuck when model starts training #4443

2176 suggests to `console NCCL_P2P_DISABLE=1`

3416 suggests to `rm -rf /home/ga2530/.cache/torch_extensions/py310_cu116`

4285 suggests to change `TORCH_EXTENSIONS_DIR` environment variable

microsoft / DeepSpeed

[BUG] Training gets stuck when model starts training #4443

2176 suggests to console NCCL_P2P_DISABLE=1

3416 suggests to rm -rf /home/ga2530/.cache/torch_extensions/py310_cu116

4285 suggests to change TORCH_EXTENSIONS_DIR environment variable

2176 suggests to `console NCCL_P2P_DISABLE=1`

3416 suggests to `rm -rf /home/ga2530/.cache/torch_extensions/py310_cu116`

4285 suggests to change `TORCH_EXTENSIONS_DIR` environment variable