F1 score inconsistency - Githubissues

Thank you for your interest in our work! Your question brings up an interesting point. The reasons our numbers don't always match the ones of the corresponding papers are multiple. First of all, we did not copy the numbers from their papers. To make the comparison fair, we used the exact same strategy to compute the metrics for every method. That's also why only methods with publicly available code are included in those tables.

Blindly taking the values from their publications is not fair and will certainly lead to a wrong comparison, since different methods use different strategies to compute the metrics (e.g. F1 score):

When computing F1 score, we take the maximum between the F1 using the localization map and the F1 using the inverse of the localization map. We do not consider pixels close to the borders of the forged area in the ground truth, since in most cases they are not accurate.
Some methods, such as CR-CNN and EXIF-SC use best threshold per image (like we do). However EXIF-SC does not compute it with the permutation, while for CR_CNN this is unknown.
RRU-Net, SPAN, CAT-Net, IF-OSN only provide the F1 score computed with a fixed threshold in their paper. SPAN and Cat-Net perform the permutation. MVSS-Net also provides the F1 using fixed threshold, but without the permutation (so their values are lower than the one we reported in our table, with fixed threshold).
Also note that some papers only test on a subset of the test datasets, usually 25%, after fine-tuning on the other 75%

You may understand that it makes no sense to compare these numbers (in many papers all these F1 values are copied, and this generates some confusion).

In the specific case of MVSS-Net, the main differences are:

We remove pixels close to the borders of the forged area
Best threshold: they compute best threshold per dataset, we do best threshold per image
Fixed threshold: they do not perform permutation (only consider pred, and not 1-pred). This is why MVSS values in our table 1 are always a bit higher than the ones in their paper on every dataset
In their case there is a huge gap between the best threshold metric and the fixed threshold metric. This is a bit weird, but I did not compute best threshold per dataset so I will skip on this
AUC: in the MVSS-Net paper, they have Casia v1, in MVSS-Net++ paper, they have Casia v1+ (with different authentic images). We use Casia v1+ aswell. We have (almost) the exact values of MVSS-Net++ paper. They did not evaluate on authentic images of nist, so they don't provide AUC value for NIST.
The big difference in NIST is very likely to be due to the fact that we used a subset of 160 images. We uses the same subset as SPAN.

As you can imagine, changing the way of computing the F1 metric or using different variants of the datasets drastically changes the final value.

In this repository, in the file test_docker/metrics.py, you can see the functions we used to compute the metrics.

grip-unina / TruFor

F1 score inconsistency #3