pytorch / torchdynamo

A Python-level JIT compiler designed to make unmodified PyTorch programs faster.
BSD 3-Clause "New" or "Revised" License
1.01k stars 124 forks source link

composer trainer dynamo errors #887

Closed msaroufim closed 2 years ago

msaroufim commented 2 years ago

This one was strange because even though I see TypeError and NotImplemented errors in the logs, the training did not stop, should they be warnings instead?

Composer is an interesting training library focused on performance and I believe they have some of the fastest implementations of pytorch algorithms https://www.mosaicml.com/blog/mlperf-2022 so if we solve this I think we can see dynamo mentioned in mlperf

Repro

pip install mosaicml

from torch.utils.data import DataLoader
from torchvision import datasets, transforms

from composer import Trainer
from composer.algorithms import ChannelsLast, CutMix, LabelSmoothing
from composer.models import mnist_model
import torchdynamo

transform = transforms.Compose([transforms.ToTensor()])
train_dataset = datasets.MNIST("data", download=True, train=True, transform=transform)
eval_dataset = datasets.MNIST("data", download=True, train=False, transform=transform)
train_dataloader = DataLoader(train_dataset, batch_size=128)
eval_dataloader = DataLoader(eval_dataset, batch_size=128)

trainer = Trainer(
    model=mnist_model(),
    train_dataloader=train_dataloader,
    eval_dataloader=eval_dataloader,
    max_duration="2ep",
    algorithms=[
        ChannelsLast(),
        CutMix(alpha=1.0),
        LabelSmoothing(smoothing=0.1),
    ]
)

with torchdynamo.optimize("eager"):
    trainer.fit()

Logs

https://gist.github.com/msaroufim/c74daa1f11d1edf8e592c1229bfc1cdc

ezyang commented 2 years ago

Error but continuing on sounds like dynamo successfully fell back to eager

msaroufim commented 2 years ago

Hmm should we catch those exceptions and log them as warnings then? My first instinct when I saw an error with a stacktrace was to stop the training job - especially stuff like this https://gist.github.com/msaroufim/c74daa1f11d1edf8e592c1229bfc1cdc#file-gistfile1-txt-L7868-L7897 where the error is in between the training progress bars is not great UX

anijain2305 commented 2 years ago

I liked @wconstab idea that we should emit single line warnings in the customer mode. Maybe we can have separate logging level for that. @mlazos might already be thinking about this.

mlazos commented 2 years ago

Yeah I liked Will's idea for single line warnings, I can hide the current pages of errors behind a verbose option or filter.

mlazos commented 2 years ago

@msaroufim I don't see any errors when running it with main torchdynamo anymore, can you confirm it passes for you?