libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.84k stars 178 forks source link

Top-1 accuracy on ImageNet drops between runs -- only difference is FFCV #352

Closed nelaturuharsha closed 11 months ago

nelaturuharsha commented 11 months ago

Hello,

Training a ResNet50 on ImageNet for a project and noticed the following issues:

For some context

class FFCVImageNet:
    def __init__(self, args):
        super(FFCVImageNet, self).__init__()

        data_root = '../imagenet-data/'

        IMAGENET_MEAN = np.array([0.485, 0.456, 0.406]) * 255
        IMAGENET_STD = np.array([0.229, 0.224, 0.225]) * 255
        DEFAULT_CROP_RATIO = 224/256
        train_image_pipeline = [RandomResizedCropRGBImageDecoder((224, 224)),
                            RandomHorizontalFlip(),
                            ToTensor(),
                            ToDevice(torch.device('cuda:0'), non_blocking=True),
                            ToTorchImage(),
                            NormalizeImage(IMAGENET_MEAN, IMAGENET_STD, np.float32)]

        val_image_pipeline = [CenterCropRGBImageDecoder((256, 256), ratio=DEFAULT_CROP_RATIO),
                              ToTensor(),
                              ToDevice(torch.device('cuda:0'), non_blocking=True),
                              ToTorchImage(),
                              NormalizeImage(IMAGENET_MEAN, IMAGENET_STD, np.float32)]

        label_pipeline = [IntDecoder(),
                            ToTensor(),
                            Squeeze(),
                            ToDevice(torch.device('cuda:0'), non_blocking=True)]

        self.train_loader = Loader(data_root + 'train_500_0.50_90.ffcv',
                              batch_size  = args.batch_size,
                              num_workers = args.workers,
                              order       = OrderOption.QUASI_RANDOM,
                              os_cache    = True,
                              drop_last   = True,
                              pipelines   = { 'image' : train_image_pipeline,
                                              'label' : label_pipeline},
                              )

        self.val_loader = Loader(data_root + 'val_500_0.50_90.ffcv',
                            batch_size  = args.batch_size,
                            num_workers = args.workers,
                            order       = OrderOption.SEQUENTIAL,
                            drop_last   = False,
                            pipelines   = { 'image' : val_image_pipeline,
                                            'label' : label_pipeline},
                            )

image

As you can see above, the performance is far worse when FFCV is used.

Would appreciate any insight into why this is happening and what could be done to improve.

Thanks!

andrewilyas commented 11 months ago

Hi @SreeHarshaNelaturu ! What training code are you using?

nelaturuharsha commented 11 months ago

Hi @andrewilyas thanks for the prompt response. This a custom harness we wrote on our own for training + pruning networks. Could you let me know any specific aspects I could send across?

Btw, I found one interesting result that seems to have fixed the problem almost -- that is adding "shuffled_indices=True" while regenerating the beton.

The loss/accuracy curves now look like this Screenshot 2023-10-08 at 6 35 13 PM

Blue - FFCV (w/o shuffle_indices = True while creating the beton) Orange - PyTorch DataLoader maroon - FFCV (w/ shuffle indices = True while creating the beton)

Hope this could help people out, I think this is related to usage of OrderOption.QUASI_RANDOM as indicated in issue #304 .

Regarding speed-up

I also tested this on the CelebA dataset, and could provide code to reproduce -- there was little or no speedup achieved due to use of FFCV and there are very little throughput gains using FFCV (38 mins v/s 41 mins) on ImageNet [Device: NVIDIA A6000]

Is this speedup only to be expected on mixed-precision training?

(Please let me know if its better to open a different issue for the speedup related component)

Cheers!