Separation of training and test sets

shanface33 / AutoSplice_Dataset

AutoSplice: A Text-prompt Manipulated Image Dataset for Media Forensics, WMF@CVPR2023

39 stars 0 forks source link

Many thanks for this interesting work and the sharing of the dataset to the public!

I am just wondering how you separate the images in the dataset into training and test sets when you fine-tuned the models as described in your paper?

Specifically, I have two questions:

When talking about "no overlapping between the two (training and testing) sets", did you ensure that, for example, when an authentic image like 39406.jpg is chosen to be in the training set, its corresponding forged image like 39406_0.jpg is only chosen to be in the training set but not in the testing set? That is to say, images from the same original source only appear in the identical set.
The numbers of authentic images and forged images (let me take Forged_JPEG100 as an example) are not the same. In addition, one authentic image (e.g., 39406.jpg) may have multiple forged versions (e.g., 39406_0.jpg and 39406_1.jpg). How did you handle these imbalance cases when building the training and test sets?

Thank you in advance and I look forward to your reply!

We appreciate your feedback! We divided the images into training and testing sets by randomly selecting approximately 80% of image IDs for training and the remaining 20% for testing.

Specifically,

You have understood correctly. The division of the dataset is performed according to the image ID. This means that if a specific ID, like 39406, is assigned to the training set, all related images with that ID, such as Authentic/39406.jpg and Forged_JPEG75/39406_0.jpg, Forged_JPEG75/39406_1.jpg, and Forged_JPEG75/39406_2.jpg, will be included in the training set as well.
First, the different image ID numbers in authentic and forged images resulted from the data cleaning process. As certain authentic images got low-quality forged images, we opted to remove the forged images while retaining the authentic ones. Second, the presence of image imbalance comes from both the data-cleaning process and the multiple outputs generated by the DALL-E2 model. Despite removing some forged images during data cleaning, we have retained their authentic images to mitigate the imbalance. In our experiments, we used the imbalanced dataset, with a forged/authentic ratio of approximately 1.6/1, for fine-tuning both detection tasks. To achieve a balanced dataset, one possible approach is to supplement it with additional authentic images sourced from the Visual News dataset.

I hope this will help. Please feel free to let us know if you have further questions. Thanks!

shanface33 / AutoSplice_Dataset

Separation of training and test sets #2