libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.8k stars 180 forks source link

Fix random translate operation using uninitialized memory #254

Closed KellerJordan closed 1 year ago

KellerJordan commented 1 year ago

The default random translate operation appears to be using uninitialized memory. This results in nondeterministic behavior for e.g. CIFAR-10 trainings when the GPU is also in use by another process. This PR fixes this.

Screen Shot 2022-09-10 at 2 11 54 AM

The behavior can be reproduced with the following script:

import os
from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt
import torch
import torchvision
import torchvision.transforms as T
from ffcv.fields.decoders import IntDecoder, SimpleRGBImageDecoder
from ffcv.loader import Loader, OrderOption
from ffcv.pipeline.operation import Operation
from ffcv.transforms import RandomTranslate, Convert, ToDevice, ToTensor, ToTorchImage
from ffcv.transforms.common import Squeeze

CIFAR_MEAN = [125.307, 122.961, 113.8575]
CIFAR_STD = [51.5865, 50.847, 51.255]
denormalize = T.Normalize(-np.array(CIFAR_MEAN)/np.array(CIFAR_STD), 1/np.array(CIFAR_STD))

label_pipeline = [IntDecoder(), ToTensor(), ToDevice('cuda:0'), Squeeze()]
image_pipeline = [
    SimpleRGBImageDecoder(),
    RandomTranslate(padding=4),
    ToTensor(),
    ToDevice('cuda:0', non_blocking=True),
    ToTorchImage(),
    Convert(torch.float16),
    T.Normalize(CIFAR_MEAN, CIFAR_STD),
]

loader = Loader(f'/tmp/cifar_train.beton',
                        batch_size=512,
                        num_workers=8,
                        order=OrderOption.RANDOM,
                        drop_last=True,
                        pipelines={'image': image_pipeline,
                                   'label': label_pipeline})

for _ in range(2):
    imgs_t = []
    for inputs, _ in tqdm(loader):
        img_t = inputs.float()
        imgs_t.append(img_t.clone())

    img_t = denormalize(imgs_t[0][:8].cpu())
    img_t1 = torchvision.utils.make_grid(img_t, nrow=4) / 255
    plt.figure(figsize=(20, 20))
    plt.imshow(img_t1.permute(1, 2, 0).cpu().numpy())
    plt.show()
andrewilyas commented 1 year ago

I think this is a duplicate of #184 which will hopefully get merged in soon!

KellerJordan commented 1 year ago

Oh, ok thanks!