Handling different size inputs during training

arp95 commented 3 years ago

Hi,

Could you please tell how you handled different image sizes as input during the training phase? Lets say we have three images of size (1080x1080), (1608x1608) and (2000x2000). If we give these images as an input to the network during training, how was this taken care of? Were the images padded with zeros to the image resolution of maximum size? Thanks.

junyongyou commented 3 years ago

Hi, there are two issues you need to think about. 1. Image generator in a batch. I personally think padding should be avoided as it can either change image quality or change convolution results. I grouped images based on their resolutions, such that the images served in each batch have same resolution. 2. Basically TRIQ can handle arbitrary resolutions. However, then I developed the first version of TRIQ (i.e., the repo now), I simply used the largest resolution in the image set. Therefore, you can define the maximum_position_encoding (line 127 in transformer_iqa.py) according to your images. This value should be set to HW/(3232) + 1, and H,W are the largest resolution of your images. However, I have also improved TRIQ and got comparable performance, in which I used a spatial pooling method. I will release it later. For now, if you want to test TRIQ, you can just set maximum_position_encoding.

arp95 commented 3 years ago

The grouping solution sounded the best to me. But the problem with our dataset was that we don't have equal distribution of classes for the images of every possible resolution. Meaning, for image size of (2000x2000) we might have only three of the four classes. This is why I couldn't go ahead with the grouping approach. The best solution ahead was using padding only during training phase which would give a fixed size feature map on top of which transformer could be used. What do you think about this?

junyongyou commented 3 years ago

The grouping solution sounded the best to me. But the problem with our dataset was that we don't have equal distribution of classes for the images of every possible resolution. Meaning, for image size of (2000x2000) we might have only three of the four classes. This is why I couldn't go ahead with the grouping approach. The best solution ahead was using padding only during training phase which would give a fixed size feature map on top of which transformer could be used. What do you think about this?

Hi, the way I handle the situation is that I carefully split train_val_test sets, and then use augmentation (in my case I only use horizontal flip) to make sure the images in each batch have same resolution. I first group images in terms of their resolution. Then in each batch (probably cannot use a large batch size), I just serve the image with same resolutions. If the number of images with same resolution is less than batch size, then I use both duplication of the images and their horizontally flipped images to fill. I personally don't think padding is a good solution, as it potentially changes image quality and definitely changes the convolutional results.

junyongyou / triq

Handling different size inputs during training #6