When running a simple model including torch.nn.LayerNorm using deepspeed zero3 with torch.compile and compiled_autograd. An error occurs:
site-packages/torch/_subclasses/fake_tensor.py:2017] RuntimeError: Attempting to broadcast a dimension of length 0 at -1! Mismatching argument at index 1 had torch.Size([0]); but expected shape should be broadcastable to [100, 120]
We first found this error in BERT model with deepspeed Zero3 with torch.compile and compiled_autograd.
It's ok for deepspeed Zero1/2 with torch.compile and compiled_autograd
It's ok for deepspeed Zero3 with torch.compile and without compiled_autograd
There are a lot of graph beaks and recompiles in deepspeed Zero3 with torch.compile.
To simplify the issue, I made a small reproducer to extract error op(torch.nn.LayerNorm)
Expected behavior
Running the model with deepspeed Zero3 without error.
Investigation
The error: "RuntimeError: Attempting to broadcast a dimension of length 0 at -1! Mismatching argument at index 1 had torch.Size([0]); but expected shape should be broadcastable to [128, 128, 1600]"
It occurs when compiled autograd tries to trace the backward graph.
It appears in LayerNorm backward decompositions. It tries to broadcast weight_cast(torch.Size([0]) to grad_out_cast' shape([128,128,1600]) and fails.
if weight_cast is not None:
grad_x_hat = grad_out_cast * weight_cast
If bypassing the LayerNorm weight by setting nn.LayerNorm(120, eps=1e-12, elementwise_affine=False) instead of elementwise_affine=True in the file deepspeed_reproducer_cpu.py, the running is ok.
System info:
OS: Ubuntu 22.04
No GPU (it's device-independent, so we use CPU to reproduce)
Python version 3.10.12
PyTorch version 2.5.1
DeepSpeed version 0.15.3
To Reproduce
Steps to reproduce the behavior:
Set environment variable for more verbose logs: TORCH_LOGS="+dynamo,graph,graph_code,graph_breaks,recompiles,aot_graphs,aot_joint_graph,compiled_autograd_verbose"
Run with deepspeed --num_nodes 1 --num_gpus 1 deepspeed_reproducer_cpu.py
You can use --num_gpus 2/4/8 for multi-cards
Below is deepspeed_reproducer_cpu.py
import torch
import torchvision
import torchvision.transforms as transforms
import torch.distributed as dist
import deepspeed
from deepspeed.accelerator import get_accelerator
from tqdm import tqdm
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
def forward(self, x):
x = torch.flatten(x, 1) # flatten all dimensions except batch
x = F.relu(self.fc1(x))
x = self.LayerNorm1(x)
x = self.fc2(x)
return x
Hi @tohtana, I have tried setting stage3_param_persistence_threshold to zero, but it seems it doesn't help. The error still occurs.
I also opened an issue in pytorch.
Describe the bug
When running a simple model including torch.nn.LayerNorm using deepspeed zero3 with torch.compile and compiled_autograd. An error occurs:
We first found this error in BERT model with deepspeed Zero3 with torch.compile and compiled_autograd.
Expected behavior Running the model with deepspeed Zero3 without error.
Investigation
The error: "RuntimeError: Attempting to broadcast a dimension of length 0 at -1! Mismatching argument at index 1 had torch.Size([0]); but expected shape should be broadcastable to [128, 128, 1600]" It occurs when compiled autograd tries to trace the backward graph. It appears in LayerNorm backward decompositions. It tries to broadcast weight_cast(torch.Size([0]) to grad_out_cast' shape([128,128,1600]) and fails.
If bypassing the LayerNorm weight by setting
nn.LayerNorm(120, eps=1e-12, elementwise_affine=False)
instead ofelementwise_affine=True
in the file deepspeed_reproducer_cpu.py, the running is ok.System info:
To Reproduce Steps to reproduce the behavior:
TORCH_LOGS="+dynamo,graph,graph_code,graph_breaks,recompiles,aot_graphs,aot_joint_graph,compiled_autograd_verbose"
deepspeed --num_nodes 1 --num_gpus 1 deepspeed_reproducer_cpu.py
class Net(nn.Module): def init(self): super().init() self.fc1 = nn.Linear(32 32 3, 120) self.fc2 = nn.Linear(120, 10) self.LayerNorm1 = nn.LayerNorm(120, eps=1e-12, elementwise_affine=True)
compile_kwargs = {"dynamic": False} device = torch.device('cpu')
model = Net() model.to(device) criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9) modelengine, optimizer, * = deepspeed.initialize( model=model, model_parameters=model.parameters(), optimizer=optimizer, config="./deepspeed_config.json", )
torch_compile
model_engine.compile( compile_kwargs=compile_kwargs, )
dataset
transform = transforms.Compose( [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))] ) batch_size = 100 trainset = torchvision.datasets.CIFAR10( root="./DATA/CIFAR10", train=True, download=True, transform=transform )
process dataset
trainloader = DataLoader( trainset, batch_size=batch_size, sampler=DistributedSampler(trainset, shuffle=True), num_workers=16, pin_memory=True, ) progress_bar = tqdm( total=len(trainloader), desc=f"Training 1/1 epoch", position=0, leave=True, disable= dist.is_initialized() and dist.get_rank() != 0, ) for epoch in range(100): with torch._dynamo.compiled_autograd.enable( torch.compile(backend=get_accelerator().get_compile_backend(), **compile_kwargs)): running_loss = 0.0 for i, data in enumerate(trainloader, 0): inputs, labels = data inputs, labels = inputs.to(device), labels.to(device) optimizer.zero_grad()
print("Finished Training")