[BUG] Training time regression with ZeRO-3 after upgrade to torch 2.3.1 and CUDA 12.1

Describe the bug For ZeRO-3, i'm noticing an increase in training times on g5.48xlarge nodes with torch >= 2.3.1 and CUDA 12.1. I can reproduce this with small and large models, and in some cases this is a 1.5x slowdown (noticed with Llama-2-13b with a 8192 context length run).

I've had some difficulty reproducing this with smaller number of devices (for example, 2 GPUs vs 8). I also can't reproduce this on 4xA100 nodes, so I'm guessing some hardware dependence.

To Reproduce I have a basic script with accelerate + deepspeed. I've basically patched up some profiing code to one of the example scripts from HuggingFace. The script below is for a 500M Llama model, but you can repro for 7B or 13B as well.

# coding=utf-8
# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import time
import os
import argparse
from tqdm.auto import tqdm
from contextlib import nullcontext, contextmanager

from datasets.metric import FileLock
import evaluate
import torch
from datasets import load_dataset
from torch.optim import AdamW
from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup, set_seed

from accelerate import Accelerator, DistributedType
from accelerate.utils import KwargsHandler
from huggingface_hub import constants
from peft import get_peft_model, LoraConfig, TaskType
from pathlib import Path

from dataclasses import dataclass, field
from datetime import timedelta
from typing import Any, Callable, Dict, Iterable, List, Literal, Optional, Tuple, get_args

########################################################################
# This is a fully working simple example to use Accelerate
#
# This example trains a Bert base model on GLUE MRPC
# in any of the following settings (with the same script):
#   - single CPU or single GPU
#   - multi GPUS (using PyTorch distributed mode)
#   - (multi) TPUs
#   - fp16 (mixed-precision) or fp32 (normal precision)
#
# To run it in each of these various modes, follow the instructions
# in the readme for examples:
# https://github.com/huggingface/accelerate/tree/main/examples
#
########################################################################

# MODEL_ID = "meta-llama/Llama-2-7b-chat-hf"
MODEL_ID = "h2oai/h2o-danube3-500m-base"
MAX_GPU_BATCH_SIZE = 16
EVAL_BATCH_SIZE = 16
HF_HOME = constants.HF_HOME
MAX_LENGTH = 512
USE_LORA = True
MAX_STEPS_FOR_PROFILING = 10 

ProfilerActivity = Literal["cpu", "xpu", "mtia", "cuda"]
PROFILE_PATTERN_NAME = "profile_{suffix}.json"

@dataclass
class ProfileKwargs(KwargsHandler):
    """ProfileKwargs for Accelerate. This is an almost exact copy of the profiler in accelerate 0.33.0, with modifications to work with both torch 2.3.1 and 2.0.1
    """

    activities: Optional[List[ProfilerActivity]] = None
    schedule_option: Optional[Dict[str, int]] = None
    on_trace_ready: Optional[Callable] = None
    record_shapes: bool = False
    profile_memory: bool = False
    with_stack: bool = False
    with_flops: bool = False
    with_modules: bool = False
    output_trace_dir: Optional[str] = None

    def _get_profiler_activity(self, activity: ProfilerActivity) -> torch.profiler.ProfilerActivity:
        """Get the profiler activity from the string.

        Args:
            activity (str): The profiler activity name.

        Returns:
            torch.profiler.ProfilerActivity: The profiler activity.
        """

        profiler_activity_map: dict[str, torch.profiler.ProfilerActivity] = {
            "cpu": torch.profiler.ProfilerActivity.CPU,
            "cuda": torch.profiler.ProfilerActivity.CUDA,
        }

        if activity not in profiler_activity_map:
            raise ValueError(f"Invalid profiler activity: {activity}. Must be one of {list(profiler_activity_map)}.")
        return profiler_activity_map[activity]

    def build(self) -> torch.profiler.profile:
        """
        Build a profiler object with the current configuration.

        Returns:
            torch.profiler.profile: The profiler object.
        """
        activities: Optional[List[ProfilerActivity]] = None
        if self.activities is not None:
            activities = [self._get_profiler_activity(activity) for activity in self.activities]
        schedule: Optional[torch.profiler.schedule] = None
        if self.schedule_option is not None:
            schedule = torch.profiler.schedule(**self.schedule_option)

        return torch.profiler.profile(
            activities=activities,
            schedule=schedule,
            on_trace_ready=self.on_trace_ready,
            record_shapes=self.record_shapes,
            profile_memory=self.profile_memory,
            with_stack=self.with_stack,
            with_flops=self.with_flops,
            with_modules=self.with_modules,
        )

class AcceleratorWithProfiler(Accelerator):
    @contextmanager
    def profile(self, profile_handler: ProfileKwargs | None = None):
        """
        ProfileKwargs for Accelerate. This is an almost exact copy of the accelerator.profile() in accelerate 0.33.0, with modifications to work with both torch 2.3.1 and 2.0.1
        """
        profile_handler = profile_handler or self.profile_handler or ProfileKwargs()

        with profile_handler.build() as profiler:
            yield profiler

        if profile_handler.output_trace_dir is None:
            return

        os.makedirs(profile_handler.output_trace_dir, exist_ok=True)
        profiler.export_chrome_trace(
            os.path.join(profile_handler.output_trace_dir, PROFILE_PATTERN_NAME.format(suffix=self.process_index))
        )
        self.wait_for_everyone()

def trace_handler(p):
    output = p.key_averages().table(sort_by="self_cuda_time_total", row_limit=10)
    print(output)

def get_dataloaders(accelerator: Accelerator, batch_size: int = 16):
    """
    Creates a set of `DataLoader`s for the `glue` dataset,
    using "bert-base-cased" as the tokenizer.

    Args:
        accelerator (`Accelerator`):
            An `Accelerator` object
        batch_size (`int`, *optional*):
            The batch size for the train and validation DataLoaders.
    """
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
    tokenizer.pad_token = tokenizer.eos_token 
    datasets = load_dataset("glue", "mrpc")

    def tokenize_function(examples):
        outputs = tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, max_length=MAX_LENGTH)
        return outputs

    # Apply the method we just defined to all the examples in all the splits of the dataset
    # starting with the main process first:
    with accelerator.main_process_first():
        tokenized_datasets = datasets.map(
            tokenize_function,
            batched=True,
            remove_columns=["idx", "sentence1", "sentence2"],
        )

    # We also rename the 'label' column to 'labels' which is the expected name for labels by the models of the
    # transformers library
    tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

    def collate_fn(examples):
        # When using mixed precision we want round multiples of 8/16
        if accelerator.mixed_precision == "fp8":
            pad_to_multiple_of = 16
        elif accelerator.mixed_precision != "no":
            pad_to_multiple_of = 8
        else:
            pad_to_multiple_of = None

        return tokenizer.pad(
            examples,
            padding="longest",
            max_length=MAX_LENGTH,
            pad_to_multiple_of=pad_to_multiple_of,
            return_tensors="pt",
        )

    # Instantiate dataloaders.
    train_dataloader = DataLoader(
        tokenized_datasets["train"], shuffle=True, collate_fn=collate_fn, batch_size=batch_size, drop_last=True
    )
    eval_dataloader = DataLoader(
        tokenized_datasets["validation"],
        shuffle=False,
        collate_fn=collate_fn,
        batch_size=EVAL_BATCH_SIZE,
        drop_last=(accelerator.mixed_precision == "fp8"),
    )

    return train_dataloader, eval_dataloader

def training_function(config, args):
    # initialize profiler if needed
    if args.is_profile:
        # Initialize accelerator
        accelerator = AcceleratorWithProfiler(
            cpu=args.cpu, 
            mixed_precision=args.mixed_precision,
        )
    else:
        kwargs_handlers = None
        accelerator = Accelerator(cpu=args.cpu, mixed_precision=args.mixed_precision)
    # Sample hyper-parameters for learning rate, batch size, seed and a few other HPs
    lr = config["lr"]
    num_epochs = int(config["num_epochs"])
    seed = int(config["seed"])
    batch_size = int(config["batch_size"])

    metric = evaluate.load("glue", "mrpc")

    # If the batch size is too big we use gradient accumulation
    gradient_accumulation_steps = 1
    if batch_size > MAX_GPU_BATCH_SIZE and accelerator.distributed_type != DistributedType.TPU:
        gradient_accumulation_steps = batch_size // MAX_GPU_BATCH_SIZE
        batch_size = MAX_GPU_BATCH_SIZE

    set_seed(seed)
    train_dataloader, eval_dataloader = get_dataloaders(accelerator, batch_size)
    # New breaking change for multi-process download: https://github.com/huggingface/transformers/issues/31019#issuecomment-2267696213
    lock_path = Path(f"{HF_HOME}/hub/{MODEL_ID.replace('/', '--')}.lock").expanduser()
    lock_path.parent.mkdir(parents=True, exist_ok=True)
    with FileLock(lock_path):
        from huggingface_hub import snapshot_download
        model_path = snapshot_download(repo_id=MODEL_ID)
    # # Instantiate the model (we build the model here so that the seed also control new weights initialization)
    model = AutoModelForSequenceClassification.from_pretrained(model_path, return_dict=True, attn_implementation="flash_attention_2")
    if USE_LORA:
        lora_config = LoraConfig(task_type=TaskType.SEQ_CLS)
        model = get_peft_model(model, lora_config)
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
    model.config.pad_token_id = tokenizer.eos_token_id
    # We could avoid this line since the accelerator is set with `device_placement=True` (default value).
    # Note that if you are placing tensors on devices manually, this line absolutely needs to be before the optimizer
    # creation otherwise training will not work on TPU (`accelerate` will kindly throw an error to make us aware of that).
    model = model.to(accelerator.device)
    # Instantiate optimizer
    optimizer = AdamW(params=model.parameters(), lr=lr)

    # Instantiate scheduler
    lr_scheduler = get_linear_schedule_with_warmup(
        optimizer=optimizer,
        num_warmup_steps=100,
        num_training_steps=(len(train_dataloader) * num_epochs) // gradient_accumulation_steps,
    )

    # Prepare everything
    # There is no specific order to remember, we just need to unpack the objects in the same order we gave them to the
    # prepare method.

    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )
    # If profiling, just do one epoch for 20 steps
    num_epochs = 1 if args.is_profile else num_epochs
    if args.is_profile:
        profile_kwargs = ProfileKwargs(
            activities=["cpu", "cuda"],
            schedule_option={"wait": 1, "warmup": 1, "active": 3, "repeat": 2, "skip_first": 5}, 
            on_trace_ready=trace_handler,
            output_trace_dir="trace",
        )
        cm = accelerator.profile(profile_kwargs)
    else:
        cm = nullcontext()
    # Now we train the model
    for epoch in range(num_epochs):
        model.train()
        fwd_times = []
        bwd_times = []
        progress_bar = tqdm(range(len(train_dataloader)), disable=not accelerator.is_local_main_process)
        # Wrap in context manager
        with cm as prof:
            for step, batch in enumerate(train_dataloader):
                # We could avoid this line since we set the accelerator with `device_placement=True`.
                batch.to(accelerator.device)
                s = time.time()
                outputs = model(**batch)
                e = time.time()
                fwd_times.append(e - s)
                loss = outputs.loss
                loss = loss / gradient_accumulation_steps
                s = time.time()
                accelerator.backward(loss)
                e = time.time()
                bwd_times.append(e - s)
                if step % gradient_accumulation_steps == 0:
                    optimizer.step()
                    lr_scheduler.step()
                    optimizer.zero_grad()

                if args.is_profile:
                    prof.step()

                progress_bar.update()
                if args.is_profile and step == MAX_STEPS_FOR_PROFILING-1:
                    break
        progress_bar.close()
        accelerator.print(f"Average Forward Time: {sum(fwd_times)/ len(fwd_times)}")
        accelerator.print(f"Average Backward Time: {sum(bwd_times)/ len(bwd_times)}")
        model.eval()
        for step, batch in enumerate(eval_dataloader):
            # We could avoid this line since we set the accelerator with `device_placement=True`.
            batch.to(accelerator.device)
            with torch.no_grad():
                outputs = model(**batch)
            predictions = outputs.logits.argmax(dim=-1)
            predictions, references = accelerator.gather_for_metrics((predictions, batch["labels"]))
            metric.add_batch(
                predictions=predictions,
                references=references,
            )

        eval_metric = metric.compute()
        # Use accelerator.print to print only on the main process.
        accelerator.print(f"epoch {epoch}:", eval_metric)

def main():
    parser = argparse.ArgumentParser(description="Simple example of training script.")
    parser.add_argument(
        "--mixed_precision",
        type=str,
        default=None,
        choices=["no", "fp16", "bf16", "fp8"],
        help="Whether to use mixed precision. Choose"
        "between fp16 and bf16 (bfloat16). Bf16 requires PyTorch >= 1.10."
        "and an Nvidia Ampere GPU.",
    )
    parser.add_argument(
        "--is_profile",
        action="store_true",
        help="Whether to enable CUDA profiling",
    )
    parser.add_argument("--cpu", action="store_true", help="If passed, will train on the CPU.")
    args = parser.parse_args()
    config = {"lr": 2e-5, "num_epochs": 3, "seed": 42, "batch_size": MAX_GPU_BATCH_SIZE}
    training_function(config, args)

if __name__ == "__main__":
    main()

My accelerate config is:

compute_environment: LOCAL_MACHINE
deepspeed_config:
 deepspeed_config_file: llm-forge/configs/deepspeed/zero_3.json
 zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 8
use_cpu: false

My Zero-3 config is:

{   
    "fp16": {
        "enabled": "auto"
    },
    "bf16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": 5e8,
        "stage3_prefetch_bucket_size": 5e8,
        "stage3_param_persistence_threshold": 1e6,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true,
        "round_robin_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 10,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Command to run

Without profiiling

accelerate launch --config_file accelerate_config.yaml min_repro.py --mixed_precision bf16

With profiling

accelerate launch --config_file accelerate_config.yaml min_repro.py --mixed_precision bf16 --is_profile

Expected behavior Training time is expected to stay the same or get better with upgrade.

ds_report output The two environments I've tested with :

torch 2.3.1 + CUDA 12.1 :

[2024-08-05 15:15:16,337] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  using untested triton version (2.3.1), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ray/anaconda3/lib/python3.11/site-packages/torch']
torch version .................... 2.3.1+cu121
deepspeed install path ........... ['/home/ray/anaconda3/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.14.0, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
shared memory (/dev/shm) size .... 373.95 GB

The older environment: torch 2.0.1 + CUDA 11.8:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ray/anaconda3/lib/python3.11/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/home/ray/anaconda3/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.14.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
shared memory (/dev/shm) size .... 373.95 GB

Screenshots

I can add more screenshots if needed, but the training times look as follows for the 500M model used in the script:

After:

With profiling:

Before:

With profiling:

Note that these training times are small because of the small model and context length used. You can increase context length, increase model size or switch to full param and you should still be able to reproduce this. (and then it becomes less tolerable) . Please let me know if you can't.

System info (please complete the following information):

OS: 20.04.6 LTS (Focal Fossa)
GPU count and types: 8xA10s (A g5.48xlarge node on AWS)
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version: 3.11.9
Any other relevant info about your setup: I'm using flash attention, and Accelerate. I've kept flash attention versions the same (2.5.9.post1)

Launcher context HuggingFace Accelerate

Additional context I can repro with the latest deepspeed version- 14.4 as well. I've given a min repro code with profiler so that it's easy to dive into the specifics of what's happening.

microsoft / DeepSpeed

[BUG] Training time regression with ZeRO-3 after upgrade to torch 2.3.1 and CUDA 12.1 #5844