grip-unina / TruFor

TruFor
140 stars 9 forks source link

F1 score inconsistency #3

Closed cxs-ux closed 1 year ago

cxs-ux commented 1 year ago

You have done an excellent job. I have a question to ask. For the four datasets casiav1+, coverage, columnia, and nist16, the f1 score in the mvss net paper is inconsistent with the corresponding records in your paper.

fabrizioguillaro commented 1 year ago

Thank you for your interest in our work! Your question brings up an interesting point. The reasons our numbers don't always match the ones of the corresponding papers are multiple. First of all, we did not copy the numbers from their papers. To make the comparison fair, we used the exact same strategy to compute the metrics for every method. That's also why only methods with publicly available code are included in those tables.

Blindly taking the values from their publications is not fair and will certainly lead to a wrong comparison, since different methods use different strategies to compute the metrics (e.g. F1 score):

You may understand that it makes no sense to compare these numbers (in many papers all these F1 values are copied, and this generates some confusion).

In the specific case of MVSS-Net, the main differences are:

As you can imagine, changing the way of computing the F1 metric or using different variants of the datasets drastically changes the final value.

In this repository, in the file test_docker/metrics.py, you can see the functions we used to compute the metrics.