The proliferation of inflammatory or misleading "fake" news content has become increasingly common in recent years. Simultaneously, it has become easier than ever to use AI tools to generate photorealistic images depicting any scene imaginable. Combining these two -- AI-generated fake news content -- is particularly potent and dangerous. To combat the spread of AI-generated fake news, we propose the MiRAGeNews Dataset, a dataset of 12,500 high-quality real and AI-generated image-caption pairs from state-of-the-art generators. We find that our dataset poses a significant challenge to humans (60% F-1) and state-of-the-art multi-modal LLMs (< 24% F-1). Using our dataset we train a multi-modal detector (MiRAGe) that improves by +5.1% F-1 over state-of-the-art baselines on image-caption pairs from out-of-domain image generators and news publishers. We release our code and data to aid future work on detecting AI-generated content.
MiRAGeNews dataset contains a total of 15,000 pieces of real or AI-generated multimodal news (image-caption pairs) -- a training set of 10,000 pairs, a validation set of 2,500 pairs, and five test sets of 500 pairs each. Four of the test sets are out-of-domain data from unseen news publishers and image generators to evaluate detector's generalization ability.
Download MiRAGeNews from HuggingFace:
from datasets import load_dataset
dataset = load_dataset("anson-huang/mirage-news")
We will release three detectors for different modalities: MiRAGe-Img for Image-only Detection, MiRAGe-Txt for Text-only Detection, and MiRAGe for Multimodal Detection. Pretrained models and code will be available soon.
Our detectors are more robust on out-of-domain (OOD) data from unseen news publishers and image generators than SOTA MLLMs and detectors.
This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the HIATUS Program contract #2022-22072200005. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.