libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.83k stars 178 forks source link

TypeError in NormalizeImage #80

Closed PiaCuk closed 2 years ago

PiaCuk commented 2 years ago

I'm trying to train a model on ImageNet with FFCV. I created a conda environment as written in install.sh and wrote ImageNet to a .ffcv with ./write_imagenet.sh 500 0.50 90 from ffcv-imagenet. This is the error that I get:

Exception in thread Thread-12:
Traceback (most recent call last):
  File "miniconda/envs/ffcv/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "miniconda/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 80, in run
    result = self.run_pipeline(b_ix, ixes, slot, events[slot])
  File "miniconda/envs/ffcv/lib/python3.9/site-packages/ffcv/loader/epoch_iterator.py", line 134, in run_pipeline
    result = code(*args)
  File "", line 2, in stage_1
  File "miniconda/envs/ffcv/lib/python3.9/site-packages/ffcv/transforms/normalize.py", line 85, in normalize_convert
    return final_result.view(final_type)
TypeError: view(): argument 'size' (position 1) must be tuple of ints, not torch.dtype

I replaced the DataLoader of a working PyTorch training pipeline with this:

IMAGENET_MEAN = np.array([0.485, 0.456, 0.406]) * 255
IMAGENET_STD = np.array([0.229, 0.224, 0.225]) * 255
RES_TUPLE = (224, 224)
DEFAULT_CROP_RATIO = 224/256

def FFCV_ImageNet_loader(data_path, batch_size, device, train, workers=4, in_memory=False):
    """
    (param) in_memory (bool): Does the dataset fit in memory?
    """
    if train:
        decoder = RandomResizedCropRGBImageDecoder(RES_TUPLE)
        image_pipeline = [
            decoder,
            RandomHorizontalFlip(),
            ToTensor(),
            ToDevice(device, non_blocking=True),
            ToTorchImage(),
            NormalizeImage(IMAGENET_MEAN, IMAGENET_STD, np.float16)
        ]
    else:
        decoder = CenterCropRGBImageDecoder(RES_TUPLE, ratio=DEFAULT_CROP_RATIO)
        image_pipeline = [
            decoder,
            ToTensor(),
            ToDevice(device, non_blocking=True),
            ToTorchImage(),
            NormalizeImage(IMAGENET_MEAN, IMAGENET_STD, np.float16)
        ]

    label_pipeline = [
        IntDecoder(),
        ToTensor(),
        Squeeze(),
        ToDevice(device, non_blocking=True)
    ]

    order = OrderOption.QUASI_RANDOM if train else OrderOption.SEQUENTIAL
    loader = Loader(data_path,
                    batch_size=batch_size,
                    num_workers=workers,
                    order=order,
                    os_cache=in_memory,
                    drop_last=train,
                    pipelines={
                        'image': image_pipeline,
                        'label': label_pipeline
                    },
                    distributed=False)

    return loader

Any ideas on what is causing the problem?

andrewilyas commented 2 years ago

Hi @PiaCuk ! Can you post the versions of Python, PyTorch, and NumPy you are using?

PiaCuk commented 2 years ago

Hey, thanks for the quick reply! I've been meaning to post an update just now. I'm using Python 3.9.9, Numpy 1.21.5, and I just updated PyTorch from 1.7.1 to 1.10.1. My models are ResNets from torchvision 0.11.2. Here's the new error message:

Traceback (most recent call last):
  File "main.py", line 49, in <module>
    ImageNet_experiment(**params)
  File "imagenet.py", line 95, in ImageNet_experiment
    acc = distiller.train_student(**params, smooth_teacher=False)
  File "Tf_KD/virtual_teacher.py", line 123, in train_student
    student_out = self.student_model(data)
  File "miniconda/envs/ffcv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "models/resnet.py", line 35, in forward
    return self.resnet_model(X)
  File "miniconda/envs/ffcv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "miniconda/envs/ffcv/lib/python3.9/site-packages/torchvision/models/resnet.py", line 249, in forward
    return self._forward_impl(x)
  File "miniconda/envs/ffcv/lib/python3.9/site-packages/torchvision/models/resnet.py", line 232, in _forward_impl
    x = self.conv1(x)
  File "miniconda/envs/ffcv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "miniconda/envs/ffcv/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 446, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "miniconda/envs/ffcv/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 442, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same
andrewilyas commented 2 years ago

Oh ok! I think this error now has nothing to do with NormalizeImage, and instead the problem is just that your model is in full-precision mode while FFCV is loading the data in half-precision format. You can either convert the training code to work with half-precision (which I would recommend as you will see significant speedups even outside of data loading), or you can load the data in full-precision mode, by replacing np.float16 with np.float32 in the pipeline.

For information about using half-precision training, see e.g., https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html

PiaCuk commented 2 years ago

Thank you, this makes a lot of sense now! I will look into it.