staghado / vit.cpp

Inference Vision Transformer (ViT) in plain C/C++ with ggml
MIT License
225 stars 18 forks source link

Bicubic Interpolation #7

Closed mehdi-elion closed 9 months ago

mehdi-elion commented 10 months ago

Goal of this PR

This PR aims at enabling bicubic interpolation for image resizing. This should help the model infer in conditions that are similar to the ones seen during training. Hopefully, this should also help in terms of performance.

Content of this PR

This PR proposes:

Comments

I added the possibility to save the resized image in order to visually control the output of the interpolation (for both bicubic and bilinear modes). I tried it on several images of various sizes and ratios (they are part of the PR too).

It turns out :

Example

The examples below were obtained with the following command (es explained in the well designed README.md)

./bin/vit -t 4 -m ../ggml-model-f16.gguf -i ../assets/polars.jpeg 

assets/polars.jpeg

image

resized_bilinear.png

image

resized_bicubic.png

image

mehdi-elion commented 10 months ago

Hello @staghado ! I tried to implement the bicubic interpolation for image resizing. If I'm not mistaken, it still appears in your TODO list, so I figured out it could help :) I tried to be as thorough as possible in the description, but don't hesitate to reach out to me for discussion ;)

Thanks again for this great repo (to which I'd be more than happy to contribute) 👌

staghado commented 10 months ago

Hello @mehdi-elion,

First of all thank you for your great work. It is indeed a necessary thing to add, so the training transformation matches the inference one.

I will delve into the details of the implementation, carry some tests then I will get back to you soon!

staghado commented 10 months ago

I have conducted some tests on my end:

But to be fair it seems there are differences between Open CV and PIL too :

image

mehdi-elion commented 10 months ago

Hi @staghado,

Thank you very much for your feedback and the tests you've carried out. The results you shared are very interesting, and I think it's worth having a look at :) I'll try to investigate a bit based on your feedback & results and I'll come back to you then 👌

mehdi-elion commented 9 months ago

Hello @staghado ! Sorry for the late message. I eventually investigated a bit to complement your last comment : )

First of all, I ran the same tests as you did and found the same results:

I added a few plots and metrics to visualize and quantify those differences :

Here are some examples of such plots & metrics (but you can find all of them here)

polars_differences Comparison

It seems like there is a slight extra difference between torch and vit.cpp compared to the torch - cv2 difference, even though they both are rather negligible with respect to the overall pixel values (hopefully ^^’).

After examining the plots (differences), it seems like most of the difference lies in the edges. The implementation we have seems to better render sharp shapes compared to torch. See the armadillo example below (stripes on its scales are better rendered than with torch).

(from left to right: original, vit.cpp, torch) image

A quick look on PIL’s github repo lead me to those snippets (I’m not a 100% sure those are the source code of the bicubic implementation we were looking for, but that’s the closest I’ve found so far):

The last one suggests that PIL is using the bicubic convolution algorithm, which can be seen (if I understood well), as an approximation of the original bicubic interpolation algorithm.

All in all, if I sum things up by going through your hypotheses one by one:

If that’s correct, it means that each library is likely to have their own variant of the bicubic interpolation, their respective results being fairly similar (even though not strictly identical).

staghado commented 9 months ago

Fantastic stuff here, I enjoyed reading your analysis, it is really thorough and provides clear insights. I feel that we have a better understanding of what is done in the different libraries with some using convolution variants while others use equations based on neighboring pixels.

I am okay to merge the current implementation of bicubic interpolation. we can verify if it affects performance too much later.

I have been working on a benchmarking script in order to assess the performance on ImageNet1k dataset. I can try with bicubic and bilinear and see what results we get.

mehdi-elion commented 9 months ago

Thank you very much for your feedback on the analysis, I'm glad it helped 👍 I'd be happy to have it merged and run your benchmark on it (and the bilinear interpolation) as you suggested.

If you want I can submit an extra commit to delete the part of the code which outputs the resized image: it was only for debugging and removing it will save some memory usage.

Let me know if that's ok for you : )

staghado commented 9 months ago

Yes you can make the last adjustments before merging.

mehdi-elion commented 9 months ago

I just pushed that commit to remove the debug-related part (which outputs the resized img). let me know if you need any extra adjustment before you merge it ; )

staghado commented 9 months ago

Done!