mosaicml / composer

Supercharge Your Model Training
http://docs.mosaicml.com
Apache License 2.0
5.16k stars 418 forks source link

Error when using `Trainer.compile_config={}` in DDP mode #3227

Open Ghelfi opened 6 months ago

Ghelfi commented 6 months ago

Training a toy example on DDP mode with the composer runtime while using both using torch.compile through Trainer.compile_config={} and BlurPool algorithm raises a dynamo error.

** To reproduce From develop on a 2 GPU environmment.

Code:

from composer import Trainer
from composer.algorithms import ChannelsLast, CutMix, LabelSmoothing, BlurPool
from composer.core import DataSpec
from composer.models import ComposerClassifier
from composer.utils import dist
import torch
import torch.nn as nn
import torchvision 
from torchvision import datasets, transforms

# Define Model
num_classes: int = 10
resnet = torchvision.models.resnet18()
resnet.fc = nn.Linear(512, num_classes)
model = ComposerClassifier(module=resnet, num_classes=num_classes)

# Normalization constants
mean = (0.507, 0.487, 0.441)
std = (0.267, 0.256, 0.276)
batch_size = 1024
cifar10_transforms = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean, std)])

# Download Data
data_directory = "./data"
train_dataset = datasets.CIFAR10(data_directory, train=True, download=True, transform=cifar10_transforms)

# Build DataSpec
train_dataloader = torch.utils.data.DataLoader(
    train_dataset, batch_size=batch_size, sampler=dist.get_sampler(train_dataset, drop_last=True, shuffle=True)
)
spec = DataSpec(train_dataloader, device_transforms=None, get_num_samples_in_batch=lambda batch: len(batch[0]))

trainer = Trainer(
    model=model,
    train_dataloader=train_dataloader,
    max_duration="2ep",
    algorithms=[BlurPool(), LabelSmoothing(smoothing=0.1), CutMix(alpha=1.0), ChannelsLast()],
    compile_config={},
)
trainer.fit()

Steps to reproduce the behavior:

  1. Install from dev
  2. run composer -n 2 example.py (see code above)

Dynamo Error:

[rank0]: torch._dynamo.exc.BackendCompilerFailed: backend='compile_fn' raised:
[rank0]: AttributeError: 'Conv2d' object has no attribute 'requires_grad'

[rank0]: Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

It works if I remove either DDP, BlurPool, or torch.compile.

mvpatel2000 commented 6 months ago

@Skylion007 do you think this is a torch error or something we can do differently?

mvpatel2000 commented 6 months ago

@Ghelfi do you know if this works for you elsewhere, e.g. if you compile outside Composer? Will help us narrow down if its a Composer issue or PyTorch issue, as the trace looks more like a Pytorch issue to me

Ghelfi commented 6 months ago

This is not clear to me. The provided example above works if you remove the BlurPool algorithm, which is only on the composer side.

I'll try to redefine some model layer before feeding it to the trainer to mimic the behaviour outside of any composer scope.

Ghelfi commented 5 months ago

On torch 2.3, adding torch._dynamo.config.optimize_ddp = False at the start of the file seems to fix it.

I am having issue with DDP and torch.compile on other leads also. I'll keep investigating.