Bicubic Interpolation - Githubissues

mehdi-elion commented 10 months ago

Goal of this PR

This PR aims at enabling bicubic interpolation for image resizing. This should help the model infer in conditions that are similar to the ones seen during training. Hopefully, this should also help in terms of performance.

Content of this PR

This PR proposes:

an implementation of bicubic interpolation for image resizing
a refactorization of the vit_image_preprocess function to enable both "bilinear" and "bicubic" options
a few extra images in the assets/ folder to test it (espcially image bigger and smaller than the actual ViT input size)
a slight modification of the vit_image_preprocess signature with two extra arguments:
- one argument to choose the interpolation mode
- one argument to choose whether to save or not the resized image (mostly for debugging purposes, can be deprecated easily)

Comments

I added the possibility to save the resized image in order to visually control the output of the interpolation (for both bicubic and bilinear modes). I tried it on several images of various sizes and ratios (they are part of the PR too).

It turns out :

bicubic interpolation has been working well so far (more tests, benchmarks can be carried out)
bilinear interpolation seems to generate some artifacts (see example below); I'm not a 100% sure but this might be due to the fact that scale is computed with the max of width and height instead of having two separate scales (one for height and one for width). If that's correct, that might explain such results with images having distinct width and height.

Example

The examples below were obtained with the following command (es explained in the well designed README.md)

./bin/vit -t 4 -m ../ggml-model-f16.gguf -i ../assets/polars.jpeg

assets/polars.jpeg

resized_bilinear.png

resized_bicubic.png

mehdi-elion commented 10 months ago

Hello @staghado ! I tried to implement the bicubic interpolation for image resizing. If I'm not mistaken, it still appears in your TODO list, so I figured out it could help :) I tried to be as thorough as possible in the description, but don't hesitate to reach out to me for discussion ;)

Thanks again for this great repo (to which I'd be more than happy to contribute) 👌

staghado commented 10 months ago

Hello @mehdi-elion,

First of all thank you for your great work. It is indeed a necessary thing to add, so the training transformation matches the inference one.

I will delve into the details of the implementation, carry some tests then I will get back to you soon!

staghado commented 10 months ago

I have conducted some tests on my end:

It seems that it's working fine for the images i tried. I did some comparison with the bicubic interpolation used in torchvision (PIL under the hood) and found that they are not exactly the same. Some reasons i could think of why this is the case :
- The normalization performed after the resizing,
- png (lossless) vs jpeg(lossey)
- Precision in the data types and casting used.
- Different bicubic interpolation formulae
Note : the intensity is just the minmax normalized absolute difference between the two images.

But to be fair it seems there are differences between Open CV and PIL too :

mehdi-elion commented 10 months ago

Hi @staghado,

Thank you very much for your feedback and the tests you've carried out. The results you shared are very interesting, and I think it's worth having a look at :) I'll try to investigate a bit based on your feedback & results and I'll come back to you then 👌

mehdi-elion commented 9 months ago

Hello @staghado ! Sorry for the late message. I eventually investigated a bit to complement your last comment : )

First of all, I ran the same tests as you did and found the same results:

there are some differences between the proposed bicubic implementation and that of torch/PIL
likewise, there also are some differences between torch/PIL and openCV implementations indeed

I added a few plots and metrics to visualize and quantify those differences :

difference refers to the max (along channels axis) absolute difference between images (images are scaled from 0 to 1 after dividing by 255)
avg difference refers to the sum of pixel-wise differences divided by the number of pixels, i.e. np.sum(np.abs(image_a-image_b)) / (w * h)
max difference refers to the maximum of pixel-wise differences, i.e. np.max(np.abs(image_a-image_b))

Here are some examples of such plots & metrics (but you can find all of them here)

polars_differences Comparison

It seems like there is a slight extra difference between torch and vit.cpp compared to the torch - cv2 difference, even though they both are rather negligible with respect to the overall pixel values (hopefully ^^’).

After examining the plots (differences), it seems like most of the difference lies in the edges. The implementation we have seems to better render sharp shapes compared to torch. See the armadillo example below (stripes on its scales are better rendered than with torch).

(from left to right: original, vit.cpp, torch)

A quick look on PIL’s github repo lead me to those snippets (I’m not a 100% sure those are the source code of the bicubic implementation we were looking for, but that’s the closest I’ve found so far):

The last one suggests that PIL is using the bicubic convolution algorithm, which can be seen (if I understood well), as an approximation of the original bicubic interpolation algorithm.

All in all, if I sum things up by going through your hypotheses one by one:

the normalization performed after the resizing shoudn’t affect the images we observe as we don’t normalize them (they are separate images I created just for debugging, they don’t undergo normalization) ❌
I believe “png (lossless) vs jpeg(lossey)” shouldn’t be an issue either as it is the same image that undergoes the different bicubic implementations ❌
I looked into the “data types and casting” hypothesis but couldn’t find any conclusive thing❓
I believe the “Different bicubic interpolation formulae” hypothesis is the right one for the above mentioned reasons ✅ 👍 (side note: I read that some implementations may add an offset from -0.5 to -2 on new pixel coordinates to prevent the interpolated image from shifting, but I doubt this is what we're looking for ^^')

If that’s correct, it means that each library is likely to have their own variant of the bicubic interpolation, their respective results being fairly similar (even though not strictly identical).

staghado commented 9 months ago

Fantastic stuff here, I enjoyed reading your analysis, it is really thorough and provides clear insights. I feel that we have a better understanding of what is done in the different libraries with some using convolution variants while others use equations based on neighboring pixels.

I am okay to merge the current implementation of bicubic interpolation. we can verify if it affects performance too much later.

I have been working on a benchmarking script in order to assess the performance on ImageNet1k dataset. I can try with bicubic and bilinear and see what results we get.

mehdi-elion commented 9 months ago

Thank you very much for your feedback on the analysis, I'm glad it helped 👍 I'd be happy to have it merged and run your benchmark on it (and the bilinear interpolation) as you suggested.

If you want I can submit an extra commit to delete the part of the code which outputs the resized image: it was only for debugging and removing it will save some memory usage.

Let me know if that's ok for you : )

staghado commented 9 months ago

Yes you can make the last adjustments before merging.

mehdi-elion commented 9 months ago

I just pushed that commit to remove the debug-related part (which outputs the resized img). let me know if you need any extra adjustment before you merge it ; )

staghado commented 9 months ago

Done!

staghado / vit.cpp

Bicubic Interpolation #7

Goal of this PR

Content of this PR

Comments

Example

assets/polars.jpeg

resized_bilinear.png

resized_bicubic.png