photosynthesis-team / piq

Measures and metrics for image2image tasks. PyTorch.
Apache License 2.0
1.4k stars 120 forks source link

Can't understand the benchmark table #354

Closed jasony93 closed 1 year ago

jasony93 commented 1 year ago

I appreciate for this great work first of all.

I was looking at the benchmark table where you use SRCC values obtained with PIQ and reported in surveys.

There are two values in each cell, but from my understanding the first value is SRCC calculated from PIQ metric and surveys.

If that is the case, what does the second value represent?

Is it SRCC calculated from survey AND survey...?

Or have I not understood correctly?

Thank in advance.

denproc commented 1 year ago

Hi @jasony93, Thank you for raising the question. As you pointed out, each cell contains two values of SRCC. The first value is an SRCC estimation using the PIQ implementation of the metric on one of the available benchmark datasets using the command in the bechmark section. The second value is a reference SRCC value reported in the original paper describing the method or evaluation surveys. Having these two values, we show that PIQ implementation matches or is close to the performance reported in the paper/survey or original implementations in Matlab.

Hope it helps to navigate the benchmark results. Otherwise, just let us know about any other concerns you may have.

jasony93 commented 1 year ago

It helped, but honestly I am still confused.

When I read about what SRCC is, I understood it as ranking correlation between two variables,

and in this case, I think the two variables are: one being people's evaluation and the other being whatever metric we want to use.

So, the first value is an SRCC estimation using the PIQ implementation AND people's evaluation...?

and the second value is an SRCC estimation using the method proposed in the original paper AND people's evaluation...?

Another question on my mind is that, in the case of BRISQUE, it seems that there is a huge gap between the two variables, which means there is a trend that two methods are in disagreement. Is this a good sign because the PIQ implementation made an improvement from the traditional method, OR is this a bad sign because the PIQ implementation was not able to capture the analogy of the traditional method?

denproc commented 1 year ago

Yes, it is a score showing rank correlation between quality assessment by a metric and human-based evaluation.

Image quality assessment datasets usually contain images (clean and corrupted using different distortions) which are evaluated by people. The human evaluation is aggregated into a Mean Opinion Scores (MOS) or Differential Mean Opinion Scores (DMOS) to rank the images.

Therefore, the first value in our table shows a rank correleation (SRCC) between assessments by PIQ implementation of a metric and human-based MOS(DMOS). The second value shows the correlation between originaly proposed metric and the same human-based MOS used above. Having these two SRCC estimations, we can show that our implementation performs similarly to the original metric.

Moving to the BRISQUE case, the difference in SRCC values might suggest dissimilarities between the implementation. There is a reason why the BRISQUE case is expected to perform differently. Original BRISQUE is implemented in MATLAB, which we reimplemented using python and pytorch. Unfortunately, MATLAB uses different resize algorithm compared to one used in python (opencv, pytorch), introducing additional dissimality in performance. More details are available here. As a result, some of the available python implementations of BRISQUE cannot deliver the performance of the original MATLAB version, which is not often communicated by in the description of implementations.

Preparing our release, we tried to make PIQ implementation as close as possible by implementing resize function from scratch to match matlab implementation. However, we are still working on improving the implementation to match the original one. To make sure that user know about pottential dissimilarities, we report the SRCC for current PIQ implementations vs. original implementations of the metrics.

jasony93 commented 1 year ago

That helped me a lot. Thank you very much!