Closed mehdi-elion closed 9 months ago
Hello @staghado ! I tried to implement the bicubic interpolation for image resizing. If I'm not mistaken, it still appears in your TODO list, so I figured out it could help :) I tried to be as thorough as possible in the description, but don't hesitate to reach out to me for discussion ;)
Thanks again for this great repo (to which I'd be more than happy to contribute) 👌
Hello @mehdi-elion,
First of all thank you for your great work. It is indeed a necessary thing to add, so the training transformation matches the inference one.
I will delve into the details of the implementation, carry some tests then I will get back to you soon!
I have conducted some tests on my end:
It seems that it's working fine for the images i tried. I did some comparison with the bicubic interpolation used in torchvision (PIL under the hood) and found that they are not exactly the same. Some reasons i could think of why this is the case :
Note
: the intensity is just the minmax normalized absolute difference between the two images.
But to be fair it seems there are differences between Open CV and PIL too :
Hi @staghado,
Thank you very much for your feedback and the tests you've carried out. The results you shared are very interesting, and I think it's worth having a look at :) I'll try to investigate a bit based on your feedback & results and I'll come back to you then 👌
Hello @staghado ! Sorry for the late message. I eventually investigated a bit to complement your last comment : )
First of all, I ran the same tests as you did and found the same results:
I added a few plots and metrics to visualize and quantify those differences :
difference
refers to the max (along channels axis) absolute difference between images (images are scaled from 0 to 1 after dividing by 255)avg difference
refers to the sum of pixel-wise differences divided by the number of pixels, i.e. np.sum(np.abs(image_a-image_b)) / (w * h)
max difference
refers to the maximum of pixel-wise differences, i.e. np.max(np.abs(image_a-image_b))
Here are some examples of such plots & metrics (but you can find all of them here)
It seems like there is a slight extra difference between torch and vit.cpp compared to the torch - cv2 difference, even though they both are rather negligible with respect to the overall pixel values (hopefully ^^’).
After examining the plots (differences), it seems like most of the difference lies in the edges. The implementation we have seems to better render sharp shapes compared to torch. See the armadillo example below (stripes on its scales are better rendered than with torch).
(from left to right: original, vit.cpp, torch)
A quick look on PIL’s github repo lead me to those snippets (I’m not a 100% sure those are the source code of the bicubic implementation we were looking for, but that’s the closest I’ve found so far):
The last one suggests that PIL is using the bicubic convolution algorithm, which can be seen (if I understood well), as an approximation of the original bicubic interpolation algorithm.
All in all, if I sum things up by going through your hypotheses one by one:
the normalization performed after the resizing shoudn’t affect the images we observe as we don’t normalize them (they are separate images I created just for debugging, they don’t undergo normalization) ❌
I believe “png (lossless) vs jpeg(lossey)” shouldn’t be an issue either as it is the same image that undergoes the different bicubic implementations ❌
I looked into the “data types and casting” hypothesis but couldn’t find any conclusive thing❓
I believe the “Different bicubic interpolation formulae” hypothesis is the right one for the above mentioned reasons ✅ 👍 (side note: I read that some implementations may add an offset from -0.5 to -2 on new pixel coordinates to prevent the interpolated image from shifting, but I doubt this is what we're looking for ^^')
If that’s correct, it means that each library is likely to have their own variant of the bicubic interpolation, their respective results being fairly similar (even though not strictly identical).
Fantastic stuff here, I enjoyed reading your analysis, it is really thorough and provides clear insights. I feel that we have a better understanding of what is done in the different libraries with some using convolution variants while others use equations based on neighboring pixels.
I am okay to merge the current implementation of bicubic interpolation. we can verify if it affects performance too much later.
I have been working on a benchmarking script in order to assess the performance on ImageNet1k dataset. I can try with bicubic and bilinear and see what results we get.
Thank you very much for your feedback on the analysis, I'm glad it helped 👍 I'd be happy to have it merged and run your benchmark on it (and the bilinear interpolation) as you suggested.
If you want I can submit an extra commit to delete the part of the code which outputs the resized image: it was only for debugging and removing it will save some memory usage.
Let me know if that's ok for you : )
Yes you can make the last adjustments before merging.
I just pushed that commit to remove the debug-related part (which outputs the resized img). let me know if you need any extra adjustment before you merge it ; )
Done!
Goal of this PR
This PR aims at enabling bicubic interpolation for image resizing. This should help the model infer in conditions that are similar to the ones seen during training. Hopefully, this should also help in terms of performance.
Content of this PR
This PR proposes:
vit_image_preprocess
function to enable both "bilinear" and "bicubic" optionsassets/
folder to test it (espcially image bigger and smaller than the actual ViT input size)vit_image_preprocess
signature with two extra arguments:Comments
I added the possibility to save the resized image in order to visually control the output of the interpolation (for both bicubic and bilinear modes). I tried it on several images of various sizes and ratios (they are part of the PR too).
It turns out :
scale
is computed with the max of width and height instead of having two separate scales (one for height and one for width). If that's correct, that might explain such results with images having distinct width and height.Example
The examples below were obtained with the following command (es explained in the well designed README.md)
assets/polars.jpeg
resized_bilinear.png
resized_bicubic.png