Hi!Your work is very interesting. But I have some questions about the testing dataset.
May I ask how the pixel-level F1 performance of image forgery localization are evaluated in Table 1 of the paper? Since you mentioned in another issue that you used the same subset as SPAN and only used 160 images for evaluation for the NIST dataset, how did you evaluate the CASIA and Coverage datasets? The practice of SPAN is to pre-train with the synthetic dataset, and then for CASIA, CASIAv2 is the train split, and CASIAv1 is the test split. For the coverage dataset, a 75:25 training-testing split was used.
Could you tell us the specific quantity of each test datasets? Thank you!
The code used to compute the metrics is reported in test_docker/metrics.py. We compute F1 using both the heatmap and the inverted heatmap and take the maximum. Note that in this code the ground truth has to be 0 for real pixels and 1 for fake pixels (careful: DSO-1 has inverted ground truths)
We did not change the training set based on the test set and we did not fine-tune on their training split. As for CASIA, CASIA v2 is in training and CASIA v1+ is in test set. CASIA v1+ is a version of CASIA v1 made by MVSS-Net++ where real images are drawn from COREL dataset, so that there is no overlapping with CASIA v2 (seen in training). Any other test dataset is evaluated on 100% of it, since we did not use a split for fine-tuning (except for OpenForensics and NIST, where we use a subset for computational constraints)
The number of test images for each dataset is reported in table 1 of the supplemental or table 8 of the ArXiv.
I added the lists of the images used in test in the folder test_docker/data_test/
Hi!Your work is very interesting. But I have some questions about the testing dataset. May I ask how the pixel-level F1 performance of image forgery localization are evaluated in Table 1 of the paper? Since you mentioned in another issue that you used the same subset as SPAN and only used 160 images for evaluation for the NIST dataset, how did you evaluate the CASIA and Coverage datasets? The practice of SPAN is to pre-train with the synthetic dataset, and then for CASIA, CASIAv2 is the train split, and CASIAv1 is the test split. For the coverage dataset, a 75:25 training-testing split was used. Could you tell us the specific quantity of each test datasets? Thank you!