pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
15.72k stars 6.87k forks source link

Image Augmentations on GPU Tests #483

Open felipecode opened 6 years ago

felipecode commented 6 years ago

Hello Pytorch vision people !

I am currently working on a project that requires lots of image augmentations
to perform better. And I believe this is not only my case. When reading
about topics such a domain randomization, we see that big variations on images
leads to much more generalization.

I saw that pytorch does not seem to provide a way to perform
any image augmentation on GPU as comment in #45 . In some posts I saw people
not encouraging to do it (https://discuss.pytorch.org/t/preprocess-images-on-gpu/5096) but i really disagree, specially for the cases where several augmentations are applied.
To show this point I provide a gist code showing an example illustrating the possible speed up gains on a multiplication operation ( brightness augmentation ? )

https://gist.github.com/felipecode/f3531e2d04e846da99053aff16b06028

On the gist, i show a GPU augmentation interface is working as following:

no_aug_trans = transforms.Compose( [transforms.RandomResizedCrop(224), transforms.ToTensor()])  
dataset = datasets.ImageFolder(data_path, transform=no_aug_trans)
multply_gpu = transforms.Compose([ToGPU()] + [Multiply(1.01)] * number_of_multiplications)
for data in data_loader:  
    image, labels = data  
    result = multiply_gpu(image)

Unfortunately the GPU augmentation could not be smoothly interfaced with the dataloader without sacrificing the multi threading for data reading. However, the speed ups obtained seems promising The following plot shows up when running the gist code with a TITAN Xp and Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz CPU, note that I remove the loading time when plotting.

multiplications

The plot shows the time to compute in function of number of multiplications. For this test, on each data point about 500 RGB images of 224x224 are multiplied by a constant.

Of course, there is no clear reason on why should one do 60 multiplications. However, I implemented an small library where I used imgaug library as reference and implemented more functions in GPU. For the following augmentation set used in my project I obtained about 3-4 times speed up.

transforms.Compose( [ToGPU(), Add((-5, 5)), Multiply((0.9, 1.1)), Dropout(0.2),AdditiveGaussianNoise(0.10*255),GaussianBlur(sigma=(0.0, 3.0)),ContrastNormalization((0.5, 1.5))] )

This speed up is even higher if more augmentations are added.

So, how can I improve this API ? How could something like this fit in a pull request ? How can this be more smoothly merged inside the dataloader , but keeping the multithreading for data reading ? I still have to test the training time for the full system, but I don't believe there will be any overhead since images have to be copyed to GPU anyway.

fmassa commented 6 years ago

I think that if we want to use GPU preprocessing in the data loader we would be restraining our users to use Python 2, which might be a bit too much.

Also, I'm not 100% convinced that in the setup that you showed it would be better to perform operations on the GPU.

The reason why I'm not convinced is that if we perform all the data augmentation on the CPU, then the GPU is free to run (asynchronously!) the network, while the different threads of the data loader will be loading and preprocessing data in the background. If we have transforms in the GPU, then the data augmentation and the network will be competing for resources, making either the network run slower or the batches with the data not being ready when the network has finished the batch.

Did you have the chance to see if performing the operations on the GPU was actually useful in a training pipeline?

dmenig commented 5 years ago

I have the same ideas. In my experiences, I have a multithreading pipeline to train my model, and the training thread (or process) is always waiting for the preprocessing to be over (which includes augmentations). This is especially true on model that have relatively low numbers of computations per image like in video deep learning.

I'm looking into augmentations on gpu. I just found out that opencv python doesn't allow that.

fmassa commented 5 years ago

@hyperfraise what kinds of augmentation are you looking for? Now that we have grid_sample, it's possible to perform rotation/warping/scaling/etc on both the CPU and the GPU in a very efficient manner, and could cover a number of use-cases.

dmenig commented 5 years ago

Well of course I use a little bit more but even that would be very appreciable.I wonder though if it would be that much more efficient for say 720p images, or even 224,224,3 like I use (since there is the transfer to take into account, and maybe python would be slower even on gpu than the optimized code of opencv that has C underlying).

Anyways I'm using : -flipping -cropping -padding -pixel value reassignment (with LUT matrices in cv2.LUT) -blurring and sharpening -resizing (not an augmentation though) -pixel wise differences of multiple images (not as an augmentation either) -channels permutation -noise -inverting noise (= doing x:1-x for a random subsample of the pixel) -salt and pepper noise -shape drawing (not really doable easily on pytorch I guess) -text writing on the images

I'm actually working with video, so doing all of this on cpu is pretty costly.

fmassa commented 5 years ago

BTW, have you looked at https://github.com/NVIDIA/nvvl ?

Note that you can combine flipping / cropping / padding / resizing as a single kernel launch using grid_sample, so I'd recommend you having a look at that.

ksnzh commented 5 years ago

Like https://github.com/NVIDIA/DALI

JC-S commented 5 years ago

I think that if we want to use GPU preprocessing in the data loader we would be restraining our users to use Python 2, which might be a bit too much.

Also, I'm not 100% convinced that in the setup that you showed it would be better to perform operations on the GPU.

The reason why I'm not convinced is that if we perform all the data augmentation on the CPU, then the GPU is free to run (asynchronously!) the network, while the different threads of the data loader will be loading and preprocessing data in the background. If we have transforms in the GPU, then the data augmentation and the network will be competing for resources, making either the network run slower or the batches with the data not being ready when the network has finished the batch.

Did you have the chance to see if performing the operations on the GPU was actually useful in a training pipeline?

I'm very curious about how python 2 could help preprocessing on GPU. Could you elaborate on that please?

stmax82 commented 4 years ago

@fmassa

The reason why I'm not convinced is that if we perform all the data augmentation on the CPU, then the GPU is free to run (asynchronously!) the network, while the different threads of the data loader will be loading and preprocessing data in the background.

Here is me trying to make my computer tell cats from dogs using a simple CNN and image augmentation using PIL transfomers:

image

If I use image augmentation CPU usage is almost 100%, while GPU is bored.. If I remove the image augmentation, GPU usage goes up to almost 100%.

Also it makes almost no difference if I train the net on the GPU or CPU because augmentation takes way longer anyway...

dmenig commented 4 years ago

OpenCV 4 provides the possibility to do augmentations on the GPU in python, I was told. I haven't tested it.

rohit-gupta commented 4 years ago

@stmax82 Maybe you could look into Nvidia's Dali framework