[BUG] zero3 hang during inference, need to detach part of computational graph, .detach()/torch.no_grad do not work.

orrzohar commented 2 months ago

Describe the bug I am training a video-llm model, where I encode log videos with a varying number of forward passes to avoid OOM issues. I would like to use ZeRO3, but using a part of the model a different number of times causes the computational graph to be different across nodes/GPUs, and ZeRO3 to hang.

I don't need to compute the gradients over the video encoder, and would like to just completely remove it from the computation graph. I have tried: (1) freezing the encoder, (2) applying the encoder with a @torch.no_grad(), and (3) using .detach() on the output tensors, but to no avail.

How can I effectively `remove' a part of the model checkpointing, if at all possible? I can't pre-encode entire videos as this is too memory heavy for my setup.

To Reproduce Steps to reproduce the behavior: Take any model, apply some part of it multiple times. e.g;

import torch
import torch.nn as nn
from transformers import Trainer, TrainingArguments
import clip

class VideoToEmbeddingModel(nn.Module):
    def __init__(self, clip_model_name="ViT-B/32", mlp_input_dim=512, mlp_hidden_dim=128, mlp_output_dim=8):
        super().__init__()
        self.clip_model, _ = clip.load(clip_model_name)
        self.mlp = nn.Sequential(
            nn.Linear(mlp_input_dim, mlp_hidden_dim),
            nn.ReLU(),
            nn.Linear(mlp_hidden_dim, mlp_output_dim)
        )

    def forward(self, video_frames):
        # Initialize list to store encoded frames
        encoded_frames = []

        # Loop through video frames
        for frame in video_frames:
            # Encode frame without gradients
            with torch.no_grad():
                encoded_frame = self.clip_model.encode_image(frame)
            # Append encoded frame to list
            encoded_frames.append(encoded_frame)

        # Average encoded frames
        averaged_embedding = torch.stack(encoded_frames).mean(dim=0)

        # Detach output tensor from computational graph
        detached_embedding = averaged_embedding.detach()

        # Pass the detached embedding through the MLP
        output = self.mlp(detached_embedding)
        return output

# Example usage:
model = VideoToEmbeddingModel()

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    save_total_limit=2,
    save_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    greater_is_better=True,
    save_on_each_node=True,
    fp16=True,
    deepspeed="ds_config.json",  # Your DeepSpeed config file
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=your_train_dataset,  # Your training dataset
    eval_dataset=your_eval_dataset,  # Your evaluation dataset
    compute_metrics=lambda pred: {"accuracy": torch.sum(pred.label_ids == pred.predictions.argmax(-1))},
)

# Train the model
trainer.train()

use this zero3 json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

Expected behavior I would like to find a way to still be able to encode long videos, while using zero3 as when I train larger LLMs, zero3 becomes very important.

ds_report output Please run ds_report to give us details about your setup. NCCL/hanging.

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

Ubuntu 18.04
8 nodes, 8 A100s each

Launcher context Using deepspeed launcher with hostfile

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

loadams commented 2 months ago

Hi @orrzohar - can you please share the DeepSpeed version you are using as well as the ds_report that will tell us more about your DeepSpeed install?

tohtana commented 1 month ago

@orrzohar For clarification, does video_frames have different number of frame on different ranks in your repro? Also, what are the sizes of clip_model and remaining part? If clip_model is small enough and only mlp needs ZeRO3, we can make them separated two models.

orrzohar commented 1 month ago

Hi @loadams , @tohtana,

deepspeed==0.13.5 installed via pyproject.toml:

dependencies = [
    "tokenizers==0.19.1", "sentencepiece==0.1.99", "shortuuid",
    "accelerate==0.33.0", "peft", "bitsandbytes",
    "pydantic", "markdown2[all]", "numpy", "scikit-learn==1.2.2",
    "gradio", "gradio_client==0.8.1", "easydict",
    "requests", "httpx==0.24.0", "uvicorn", "fastapi",
    "einops==0.6.1", "einops-exts==0.0.4", "timm==0.6.13",
    "fairscale", "decord", "opencv-python", "chardet",
    "datasets==2.16.1", "openai==1.8.0", "webdataset==0.2.86",
    "transformers==4.44.0", "ezcolorlog", "pytorchvideo",
    "s2wrapper@git+https://github.com/bfshi/scaling_on_scales"
]

[project.optional-dependencies]
train = ["deepspeed==0.13.5", "ninja", "wandb", "ipdb"]

ds_report:

[2024-09-10 08:07:30,994] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
DeepSpeed general environment info:
torch install path ............... ['miniconda3/envs/<env_name>/lib/python3.10/site-packages/torch']
torch version .................... 2.1.0+cu121
deepspeed install path ........... ['miniconda3/envs/<env_name>/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.13.5, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.3, cuda 12.1
shared memory (/dev/shm) size .... 373.87 GB

The clip model is typlically ~300M, the MLP is ~100M-300M, and the LLM is ~1.5-7B parameters. Ideally, I would like to detach somehow the vision encoder from the computational graph so i can use ZeRO3 on the LLM/connector

I will also note: this is based off of the LLaVA codebase + therefore uses a huggingface trainer wrapper

microsoft / DeepSpeed

[BUG] zero3 hang during inference, need to detach part of computational graph, .detach()/torch.no_grad do not work. #6438