quantization in evaluation

sniklaus commented 3 years ago

Thanks for sharing your code! I just looked into it a little bit and it seems there is no quantization in the evaluation?

https://github.com/tding1/CDFI/blob/d7f79e502674187b7a7b645a7812fd9fa30a6608/test.py#L36-L47

However, it is common practice to quantize your interpolation estimate before computing any metrics as shown in the examples below. If you submit results to a benchmark, like the one from Middlebury, you will have to quantize the interpolation estimates to save them as an image so it has been the norm to quantize all results throughout the evaluation.

https://github.com/sniklaus/sepconv-slomo/blob/46041adec601a4051b86741664bb2cdc80fe4919/benchmark.py#L28 https://github.com/hzwer/arXiv2020-RIFE/blob/15cb7f2389ccd93e8b8946546d4665c9b41541a3/benchmark/Vimeo90K.py#L36 https://github.com/baowenbo/DAIN/blob/9d9c0d7b3718dfcda9061c85efec472478a3aa86/demo_MiddleBury.py#L162-L166 https://github.com/laomao0/BIN/blob/b3ec2a27d62df966cc70880bb3d13dcf147f7c39/test.py#L406-L410

The reason why this is important is that the quantization step has a negative impact on the metrics. So if one does not quantize the results of their method before computing the metrics while the results from other methods had the quantization step in place, then the evaluation is slightly biased. Would you hence be able to share the evaluation metrics for CDFI with the quantization? This would greatly benefit future work that compares to CDFI to avoid this bias. And thanks again for sharing your code!

tding1 commented 3 years ago

Thanks for pointing this interesting question. My response is as follows:

I realized that the practice of "quantization" is not quite consistent in many of the SOTA implementations. For example, the AdaCoF evaluation script: https://github.com/HyeongminLEE/AdaCoF-pytorch/blob/f121ee0e8cb403216c7bd5183154dbd1cf6966f4/TestModule.py#L51-L55 and the CAIN evaluation script: https://github.com/myungsub/CAIN/blob/fff8fc321c5a76904ed2a12c9500e055d4c77256/main.py#L161-L175 are directly comparing the model output and the ground truth without an extra quantization step. In my opinion, as long as the methods compared together are within the same context/framework, it is a fair game (the quantization step only results in very slight difference as you mentioned above, which has negligible influence in practice).
All the quantitative results (except for those marked with dagger) listed in Table 3 of our paper, are generated with the same test script by me manually, namely without "quantization", so I believe at least the comparisons among those methods are fair.
For your information, I do test the CDFI again with the quantization step on the three benchmark datasets, see the comparison:
--------------------CDFI (w/o quantization) -----------------CDFI (w/ quantization) Vimeo-90K----------35.19, 0.978, 0.010----------------------35.17, 0.978, 0.010
Middlebury ---------37.17, 0.983, 0.008-----------------------37.14, 0.983, 0.007
UCF101-DVF--------35.24, 0.967, 0.015-----------------------35.21, 0.967, 0.015

In these tests, the extra quantization seems to lead to slightly worse PSNR (no more than 0.03), while it has no effect on SSIM and even results in a better LPIPS for the evaluation on Middlebury.

To sum up, I really appreciate your comments on the "quantization" issue. Although it is not so consistent in many of the SOTA implementations and only makes very slight difference, we will keep this in mind in future research.

sniklaus commented 3 years ago

Thanks for providing the evaluation results with quantization!

I realized that the practice of "quantization" is not quite consistent in many of the SOTA implementations.

How come you didn't re-run the affected evaluations then if you were aware of this issue?

In my opinion, as long as the methods compared together are within the same context/framework, it is a fair game.

True, but Table 3 does comparisons of methods where the quantization isn't consistent and this issue is not obvious to the reader (in fact, the paper never mentions the difference in quantization), so no fair game there.

All the quantitative results (except for those marked with dagger) listed in Table 3 of our paper, are generated with the same test script by me manually, namely without "quantization", so I believe at least the comparisons among those methods are fair.

True, but half of the methods in Table 3 are marked with a dagger, so half of the methods shown in there are put at a disadvantage (and this is not obvious to the reader).

it is not so consistent in many of the SOTA implementations

That doesn't justify not being consistent yourself, just because others haven't been.

tding1 commented 3 years ago

I realized that the practice of "quantization" is not quite consistent in many of the SOTA implementations.

How come you didn't re-run the affected evaluations then if you were aware of this issue?

What I meant here is that I adopted the practice from AdaCoF and CAIN, which happens to compare results without such "quantization", meaning the "quantization" practice is not adopted everywhere.

In my opinion, as long as the methods compared together are within the same context/framework, it is a fair game.

True, but Table 3 does comparisons of methods where the quantization isn't consistent and this issue is not obvious to the reader (in fact, the paper never mentions the difference in quantization), so no fair game there.

All the quantitative results (except for those marked with dagger) listed in Table 3 of our paper, are generated with the same test script by me manually, namely without "quantization", so I believe at least the comparisons among those methods are fair.

True, but half of the methods in Table 3 are marked with a dagger, so half of the methods shown in there are put at a disadvantage (and this is not obvious to the reader).

it is not so consistent in many of the SOTA implementations

That doesn't justify not being consistent yourself, just because others haven't been.

To be honest, before you came to me with the issue, as a reader, I never realize such a subtlety from the presentation of the existing papers, no matter they do the "quantization" or not. I conjecture that this is partially because the difference is really slight and has no actual effect in practice. In any cases, I will make it clear in the future.

sniklaus commented 3 years ago

I adopted the practice from AdaCoF and CAIN

I am under the impression that CAIN uses quantization (the first thing that calc_metrics is doing is to call quantize on the gt and the prediction): https://github.com/myungsub/CAIN/blob/09859b22741365a48510c3f531feb50f35761de8/utils.py#L208-L217

tding1 commented 3 years ago

I adopted the practice from AdaCoF and CAIN

I am under the impression that CAIN uses quantization (the first thing that calc_metrics is doing is to call quantize on the gt and the prediction): https://github.com/myungsub/CAIN/blob/09859b22741365a48510c3f531feb50f35761de8/utils.py#L208-L217

You are right! Thanks!

hzwer commented 3 years ago

Hi, we are working on another VFI work RIFE. We recently wrote the evaluation scripts for VFI methods. https://github.com/hzwer/arXiv2020-RIFE/issues/124 We reproduced EDSC, CAIN, DAIN, BMBC and some other methods and we tried our best to confirm the experimental data. Welcome to have a look.

tding1 / CDFI

quantization in evaluation #1