Closed sniklaus closed 3 years ago
Thanks for pointing this interesting question. My response is as follows:
I realized that the practice of "quantization" is not quite consistent in many of the SOTA implementations. For example, the AdaCoF evaluation script: https://github.com/HyeongminLEE/AdaCoF-pytorch/blob/f121ee0e8cb403216c7bd5183154dbd1cf6966f4/TestModule.py#L51-L55 and the CAIN evaluation script: https://github.com/myungsub/CAIN/blob/fff8fc321c5a76904ed2a12c9500e055d4c77256/main.py#L161-L175 are directly comparing the model output and the ground truth without an extra quantization step. In my opinion, as long as the methods compared together are within the same context/framework, it is a fair game (the quantization step only results in very slight difference as you mentioned above, which has negligible influence in practice).
All the quantitative results (except for those marked with dagger) listed in Table 3 of our paper, are generated with the same test script by me manually, namely without "quantization", so I believe at least the comparisons among those methods are fair.
For your information, I do test the CDFI again with the quantization step on the three benchmark datasets, see the comparison:
--------------------CDFI (w/o quantization) -----------------CDFI (w/ quantization)
Vimeo-90K----------35.19, 0.978, 0.010----------------------35.17, 0.978, 0.010
Middlebury ---------37.17, 0.983, 0.008-----------------------37.14, 0.983, 0.007
UCF101-DVF--------35.24, 0.967, 0.015-----------------------35.21, 0.967, 0.015
In these tests, the extra quantization seems to lead to slightly worse PSNR (no more than 0.03), while it has no effect on SSIM and even results in a better LPIPS for the evaluation on Middlebury.
To sum up, I really appreciate your comments on the "quantization" issue. Although it is not so consistent in many of the SOTA implementations and only makes very slight difference, we will keep this in mind in future research.
Thanks for providing the evaluation results with quantization!
I realized that the practice of "quantization" is not quite consistent in many of the SOTA implementations.
How come you didn't re-run the affected evaluations then if you were aware of this issue?
In my opinion, as long as the methods compared together are within the same context/framework, it is a fair game.
True, but Table 3 does comparisons of methods where the quantization isn't consistent and this issue is not obvious to the reader (in fact, the paper never mentions the difference in quantization), so no fair game there.
All the quantitative results (except for those marked with dagger) listed in Table 3 of our paper, are generated with the same test script by me manually, namely without "quantization", so I believe at least the comparisons among those methods are fair.
True, but half of the methods in Table 3 are marked with a dagger, so half of the methods shown in there are put at a disadvantage (and this is not obvious to the reader).
it is not so consistent in many of the SOTA implementations
That doesn't justify not being consistent yourself, just because others haven't been.
I realized that the practice of "quantization" is not quite consistent in many of the SOTA implementations.
How come you didn't re-run the affected evaluations then if you were aware of this issue?
What I meant here is that I adopted the practice from AdaCoF and CAIN, which happens to compare results without such "quantization", meaning the "quantization" practice is not adopted everywhere.
In my opinion, as long as the methods compared together are within the same context/framework, it is a fair game.
True, but Table 3 does comparisons of methods where the quantization isn't consistent and this issue is not obvious to the reader (in fact, the paper never mentions the difference in quantization), so no fair game there.
All the quantitative results (except for those marked with dagger) listed in Table 3 of our paper, are generated with the same test script by me manually, namely without "quantization", so I believe at least the comparisons among those methods are fair.
True, but half of the methods in Table 3 are marked with a dagger, so half of the methods shown in there are put at a disadvantage (and this is not obvious to the reader).
it is not so consistent in many of the SOTA implementations
That doesn't justify not being consistent yourself, just because others haven't been.
To be honest, before you came to me with the issue, as a reader, I never realize such a subtlety from the presentation of the existing papers, no matter they do the "quantization" or not. I conjecture that this is partially because the difference is really slight and has no actual effect in practice. In any cases, I will make it clear in the future.
I adopted the practice from AdaCoF and CAIN
I am under the impression that CAIN uses quantization (the first thing that calc_metrics
is doing is to call quantize
on the gt and the prediction): https://github.com/myungsub/CAIN/blob/09859b22741365a48510c3f531feb50f35761de8/utils.py#L208-L217
I adopted the practice from AdaCoF and CAIN
I am under the impression that CAIN uses quantization (the first thing that
calc_metrics
is doing is to callquantize
on the gt and the prediction): https://github.com/myungsub/CAIN/blob/09859b22741365a48510c3f531feb50f35761de8/utils.py#L208-L217
You are right! Thanks!
Hi, we are working on another VFI work RIFE. We recently wrote the evaluation scripts for VFI methods. https://github.com/hzwer/arXiv2020-RIFE/issues/124 We reproduced EDSC, CAIN, DAIN, BMBC and some other methods and we tried our best to confirm the experimental data. Welcome to have a look.
Thanks for sharing your code! I just looked into it a little bit and it seems there is no quantization in the evaluation?
https://github.com/tding1/CDFI/blob/d7f79e502674187b7a7b645a7812fd9fa30a6608/test.py#L36-L47
However, it is common practice to quantize your interpolation estimate before computing any metrics as shown in the examples below. If you submit results to a benchmark, like the one from Middlebury, you will have to quantize the interpolation estimates to save them as an image so it has been the norm to quantize all results throughout the evaluation.
https://github.com/sniklaus/sepconv-slomo/blob/46041adec601a4051b86741664bb2cdc80fe4919/benchmark.py#L28 https://github.com/hzwer/arXiv2020-RIFE/blob/15cb7f2389ccd93e8b8946546d4665c9b41541a3/benchmark/Vimeo90K.py#L36 https://github.com/baowenbo/DAIN/blob/9d9c0d7b3718dfcda9061c85efec472478a3aa86/demo_MiddleBury.py#L162-L166 https://github.com/laomao0/BIN/blob/b3ec2a27d62df966cc70880bb3d13dcf147f7c39/test.py#L406-L410
The reason why this is important is that the quantization step has a negative impact on the metrics. So if one does not quantize the results of their method before computing the metrics while the results from other methods had the quantization step in place, then the evaluation is slightly biased. Would you hence be able to share the evaluation metrics for CDFI with the quantization? This would greatly benefit future work that compares to CDFI to avoid this bias. And thanks again for sharing your code!