Cifar10 example: very slow training speed on P100 GPU

Vandermode commented 2 years ago

Hello, as the title suggested, I run the cifar example code on a tesla-p100 gpu, but got a very slow training speed (784.29492) compared to the official stated one (36 s). Anything wrong in my setting? I don't think the p100 gpu is 20x slower than a100.

Below is the command line output (directly copy & paste and don't format it, sorry for that)

(ffcv) ➜ ffcv python train_cifar10.py --config-file default_config.yaml ┌ Arguments defined────────┬───────────────────────────┐ │ Parameter │ Value │ ├──────────────────────────┼───────────────────────────┤ │ training.lr │ 0.5 │ │ training.epochs │ 24 │ │ training.lr_peak_epoch │ 5 │ │ training.batch_size │ 512 │ │ training.momentum │ 0.9 │ │ training.weight_decay │ 0.0005 │ │ training.label_smoothing │ 0.1 │ │ training.num_workers │ 4 │ │ training.lr_tta │ True │ │ data.train_dataset │ ../data/cifar_train.beton │ │ data.val_dataset │ ../data/cifar_test.beton │ └──────────────────────────┴───────────────────────────┘ 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [01:04<00:00, 1.51it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:30<00:00, 3.16it/s] Total time: 784.29492 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:11<00:00, 8.59it/s] train accuracy: 98.0% 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00, 4.07it/s] test accuracy: 92.1%

GuillaumeLeclerc commented 2 years ago

Hello,

Thanks for reporting! This is definitely not normal. The first thing I would try to troubleshoot is change the code and remove the body of the training loop to see if comes from the model training or data loading.

Keep us updated :)

Vandermode commented 2 years ago

I comment the model training part and keep the data loading. The resulting time is 54.01541. It looks like both model training and data loading are very slow.

███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:00<00:00, 182.94it/s] Total time: 54.01541 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:13<00:00, 7.05it/s] train accuracy: 10.9% 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00, 5.67it/s] test accuracy: 11.6%

GuillaumeLeclerc commented 2 years ago

Let's focus on the data loading part first. Keep the body of the training loop empty and remove the two ToDevice() operations in the loading pipeline, does it help ?

Vandermode commented 2 years ago

Following your suggestion, ridiculously, the speed is even slower. FYI, my computing device is based upon a HPC-GPU cluster at Princeton.

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97/97 [00:02<00:00, 40.68it/s] Total time: 104.98336

Vandermode commented 2 years ago

The code I used

from argparse import ArgumentParser
from typing import List
import time
import numpy as np
from tqdm import tqdm

import torch as ch
from torch.cuda.amp import GradScaler, autocast
from torch.nn import CrossEntropyLoss, Conv2d, BatchNorm2d
from torch.optim import SGD, lr_scheduler
import torchvision

from fastargs import get_current_config, Param, Section
from fastargs.decorators import param
from fastargs.validation import And, OneOf

from ffcv.fields import IntField, RGBImageField
from ffcv.fields.decoders import IntDecoder, SimpleRGBImageDecoder
from ffcv.loader import Loader, OrderOption
from ffcv.pipeline.operation import Operation
from ffcv.transforms import RandomHorizontalFlip, Cutout, \
    RandomTranslate, Convert, ToDevice, ToTensor, ToTorchImage
from ffcv.transforms.common import Squeeze
from ffcv.writer import DatasetWriter

Section('training', 'Hyperparameters').params(
    lr=Param(float, 'The learning rate to use', required=True),
    epochs=Param(int, 'Number of epochs to run for', required=True),
    lr_peak_epoch=Param(int, 'Peak epoch for cyclic lr', required=True),
    batch_size=Param(int, 'Batch size', default=512),
    momentum=Param(float, 'Momentum for SGD', default=0.9),
    weight_decay=Param(float, 'l2 weight decay', default=5e-4),
    label_smoothing=Param(float, 'Value of label smoothing', default=0.1),
    num_workers=Param(int, 'The number of workers', default=8),
    lr_tta=Param(bool, 'Test time augmentation by averaging with horizontally flipped version', default=True)
)

Section('data', 'data related stuff').params(
    train_dataset=Param(str, '.dat file to use for training', required=True),
    val_dataset=Param(str, '.dat file to use for validation', required=True),
)

@param('data.train_dataset')
@param('data.val_dataset')
@param('training.batch_size')
@param('training.num_workers')
def make_dataloaders(train_dataset=None, val_dataset=None, batch_size=None, num_workers=None):
    paths = {
        'train': train_dataset,
        'test': val_dataset

    }

    start_time = time.time()
    CIFAR_MEAN = [125.307, 122.961, 113.8575]
    CIFAR_STD = [51.5865, 50.847, 51.255]
    loaders = {}

    for name in ['train', 'test']:
        label_pipeline: List[Operation] = [IntDecoder(), ToTensor(), Squeeze()]
        # label_pipeline: List[Operation] = [IntDecoder(), ToTensor(), ToDevice('cuda:0'), Squeeze()]
        image_pipeline: List[Operation] = [SimpleRGBImageDecoder()]
        if name == 'train':
            image_pipeline.extend([
                RandomHorizontalFlip(),
                RandomTranslate(padding=2, fill=tuple(map(int, CIFAR_MEAN))),
                Cutout(4, tuple(map(int, CIFAR_MEAN))),
            ])
        image_pipeline.extend([
            ToTensor(),
            # ToDevice('cuda:0', non_blocking=True),
            ToTorchImage(),
            Convert(ch.float16),
            torchvision.transforms.Normalize(CIFAR_MEAN, CIFAR_STD),
        ])

        ordering = OrderOption.RANDOM if name == 'train' else OrderOption.SEQUENTIAL

        loaders[name] = Loader(paths[name], batch_size=batch_size, num_workers=num_workers,
                               order=ordering, drop_last=(name == 'train'),
                               pipelines={'image': image_pipeline, 'label': label_pipeline})

    return loaders, start_time

# Model (from KakaoBrain: https://github.com/wbaek/torchskeleton)
class Mul(ch.nn.Module):
    def __init__(self, weight):
       super(Mul, self).__init__()
       self.weight = weight
    def forward(self, x): return x * self.weight

class Flatten(ch.nn.Module):
    def forward(self, x): return x.view(x.size(0), -1)

class Residual(ch.nn.Module):
    def __init__(self, module):
        super(Residual, self).__init__()
        self.module = module
    def forward(self, x): return x + self.module(x)

def conv_bn(channels_in, channels_out, kernel_size=3, stride=1, padding=1, groups=1):
    return ch.nn.Sequential(
            ch.nn.Conv2d(channels_in, channels_out, kernel_size=kernel_size,
                         stride=stride, padding=padding, groups=groups, bias=False),
            ch.nn.BatchNorm2d(channels_out),
            ch.nn.ReLU(inplace=True)
    )

def construct_model():
    num_class = 10
    model = ch.nn.Sequential(
        conv_bn(3, 64, kernel_size=3, stride=1, padding=1),
        conv_bn(64, 128, kernel_size=5, stride=2, padding=2),
        Residual(ch.nn.Sequential(conv_bn(128, 128), conv_bn(128, 128))),
        conv_bn(128, 256, kernel_size=3, stride=1, padding=1),
        ch.nn.MaxPool2d(2),
        Residual(ch.nn.Sequential(conv_bn(256, 256), conv_bn(256, 256))),
        conv_bn(256, 128, kernel_size=3, stride=1, padding=0),
        ch.nn.AdaptiveMaxPool2d((1, 1)),
        Flatten(),
        ch.nn.Linear(128, num_class, bias=False),
        Mul(0.2)
    )
    model = model.to(memory_format=ch.channels_last).cuda()
    return model

@param('training.lr')
@param('training.epochs')
@param('training.momentum')
@param('training.weight_decay')
@param('training.label_smoothing')
@param('training.lr_peak_epoch')
def train(model, loaders, lr=None, epochs=None, label_smoothing=None,
          momentum=None, weight_decay=None, lr_peak_epoch=None):
    opt = SGD(model.parameters(), lr=lr, momentum=momentum, weight_decay=weight_decay)
    iters_per_epoch = len(loaders['train'])
    # Cyclic LR with single triangle
    lr_schedule = np.interp(np.arange((epochs+1) * iters_per_epoch),
                            [0, lr_peak_epoch * iters_per_epoch, epochs * iters_per_epoch],
                            [0, 1, 0])
    scheduler = lr_scheduler.LambdaLR(opt, lr_schedule.__getitem__)
    scaler = GradScaler()
    loss_fn = CrossEntropyLoss(label_smoothing=label_smoothing)

    for _ in range(epochs):
        for ims, labs in tqdm(loaders['train']):
            pass
            # opt.zero_grad(set_to_none=True)
            # with autocast():
            #     out = model(ims)
            #     loss = loss_fn(out, labs)

            # scaler.scale(loss).backward()
            # scaler.step(opt)
            # scaler.update()
            # scheduler.step()

@param('training.lr_tta')
def evaluate(model, loaders, lr_tta=False):
    model.eval()
    with ch.no_grad():
        for name in ['train', 'test']:
            total_correct, total_num = 0., 0.
            for ims, labs in tqdm(loaders[name]):
                with autocast():
                    out = model(ims)
                    if lr_tta:
                        out += model(ch.fliplr(ims))
                    total_correct += out.argmax(1).eq(labs).sum().cpu().item()
                    total_num += ims.shape[0]
            print(f'{name} accuracy: {total_correct / total_num * 100:.1f}%')

if __name__ == "__main__":
    config = get_current_config()
    parser = ArgumentParser(description='Fast CIFAR-10 training')
    config.augment_argparse(parser)
    # Also loads from args.config_path if provided
    config.collect_argparse_args(parser)
    config.validate(mode='stderr')
    config.summary()

    loaders, start_time = make_dataloaders()
    model = construct_model()
    train(model, loaders)
    print(f'Total time: {time.time() - start_time:.5f}')
    # evaluate(model, loaders)

GuillaumeLeclerc commented 2 years ago

Thank you that's exactly the modifications I had in mind. This is truly crazy. Are all the epochs this slow or just the first one ?

Vandermode commented 2 years ago

All the epochs as I can tell. 182.94 iteration / second -> 40.68it/s

GuillaumeLeclerc commented 2 years ago

182sec per epoch ?

I think the next step would be to try to identify:

Which pipeline is slow. You can disable one or the other by passing None instead of an array of operation
Which operation is slow, Removing operations in the pipeline until it's not slow anymore.

This way we will be able to narrow down the problem.

Vandermode commented 2 years ago

Sorry for the confusing clarification. The time info is: 1) unmodified full setting: 784s; 2) + keep the body of the training loop empty: 54s ; 3) from 2) + remove the two ToDevice() operations: 104s

Vandermode commented 2 years ago

I have no time to further dive into the code segments. So I leave the question open, and hope it could be addressed soon. Anyway, thanks for your help:)

GuillaumeLeclerc commented 2 years ago

Feel free to come with more information and we will do our best

GuillaumeLeclerc commented 2 years ago

Hi. I'm just going to close it temporarily just to avoid our roadmap to be cluttered but reopen it as soon you have more information and it will show up on our radar again

davidrzs commented 2 years ago

Had a similar problem with CIFAR10, where migrating from standard torchvision to FFCV increased the epoch training time from 3 minutes to 30 minutes. Calling resnet(img_batch.contiguous()) instead of resnet(img_batch) solved the problem. Now, FFCV is at performance parity with the torchvision pipeline. Feel free to dismiss my post as the issue is closed, but as I was having a similar problem, I thought I'd add my two cents.

andrewilyas commented 2 years ago

This could be due to poor support for (or improper use of) channels_last---what happens in your example if you remove your .contiguous() call, pass channels_last=False to ToTorchImage and then also get rid of the model.to(memory_format=channels_last) call?

GuillaumeLeclerc commented 2 years ago

Is it happening with the latest version of Cuda, cudnn and pytorch ?

davidrzs commented 2 years ago

Sorry for my late reply; had some deadlines. I have reverted to the standard PyTorch pipeline as it was decent enough. PyTorch was 1.10, CUDA 11.2 and cudnn 8200. But it is very likely my fault, as I was using a normal PyTorch transform in the pipeline too.

libffcv / ffcv

Cifar10 example: very slow training speed on P100 GPU #64