FFAA: Multimodal Large Language Model based Explainable Open-World Face Forgery Analysis Assistant [Paper][Project Page]
Zhengchao Huang, Bin Xia, Zicheng Lin, Zhun Mou, Wenming Yang
Clone this repository and navigate to FFAA folder
git clone https://github.com/thu-huangzc/FFAA.git
cd FFAA
Install Package
conda create -n ffaa python=3.9 -y
conda activate ffaa
pip install --upgrade pip # enable PEP 660 support
pip install -e .
Install additional packages for training cases
pip install -e ".[train]"
The training dataset has not been made public so you don't need to install these packages now.
git pull
pip install -e .
FFAA training consists of two stages: (1) Fine-tuning MLLM with hypothetical prompts stage: We introduce hypothetical prompts to the 20K FFA-VQA dataset that presume the face is either real or fake prior to analysis. By fine-tuning the MLLM on this dataset, we enable the generation of answers based on varying hypotheses; (2) Training MIDS with historical answers stage: Utilize the fine-tuned MLLM to extract answers from unused samples in the MA dataset for training.
FFAA is trained on 2 RTX 3090 GPUs with 24GB memory.
Module | Global Batch Size | Learning rate | Epochs | LoRA Rank | LoRA alpha |
---|---|---|---|---|---|
LLaVA-v1.6-Mistral-7B | 16 | 1e-4 | 3 | 32 | 48 |
Module | Global Batch Size | Learning rate | Epochs | Weight decay |
---|---|---|---|---|
MIDS | 48 | 1e-4 | 2 | 1e-5 |
Other hyperparameters settings can be found at scripts/train.
We utilize LLaVA as our base MLLM module. Certainly, you can choose any other MLLMs as the backbone. In the paper, we select LLaVA-v1.6-Mistral-7B and here are the available download links: Huggingface, Model Scope
Download the 20K FFA-VQA dataset containing hypothetical prompts and place the folder in ./playground/
Training script with DeepSpeed: finetune_mistral_lora.sh
You can use the fine-tuned MLLM to obtain the historical answer data from the unused face images. For each image, we utilize one non-hypothetical prompt and two hypothetical prompts to get three answers. In fact, MIDS is designed to select the correct one from multiple answers. Therefore, the image won't be added into the dataset if the results of the three answers are the same.
Certainly, you can download 90K Mistral-FFA-VQA dataset we provided.
After downloading the dataset, place the folder in ./playground/
Training script with DeepSpeed: train_mids_v1.sh
We evaluate models on OW-FFA-Bench, which consists of 6 generalization test sets.
First, download OW-FFA-Bench and the test set of Multi-attack in benchmark.md
Second, organize the folders in ./benchmark
as follows:
benchmark/
dfd/
imgs/
dfd.json
dfdc/
dpf/
ma/
mffdi/
pgc/
wfir/
Third, run eval scripts:
CUDA_VISIBLE_DEVICES=0 python eval_llava.py --benchmark BENCHMARK_NAME --model_name llava-mistral-7b
CUDA_VISIBLE_DEVICES=0 python eval.py --benchmark BENCHMARK_NAME --model mistral --generate_num 3 --eval_num -1
dfd
, dfdc
, dpf
, ma
, mffdi
, pgc
, wfir
.-1
represents all.Last, the results will be saved in ./results/
.
You can place the test images in ./playground/test_images
. The prompts are shown in prompts.txt.
Then you can change the image path which you want to test in inference.py
and run as follows:
CUDA_VISIBLE_DEVICES=0 python inference.py --crop 1 --visualize 1
1
means the face in the image will be automatically cropped.1
means the heatmaps of MIDS will be visulized and saved in ./heatmaps/
.If you find FFAA useful for your research and applications, please cite using this BibTeX:
@article{huang2024ffaa,
title={FFAA: Multimodal Large Language Model based Explainable Open-World Face Forgery Analysis Assistant},
author={Huang, Zhengchao and Xia, Bin and Lin, Zicheng and Mou, Zhun and Yang, Wenming},
journal={arXiv preprint arXiv:2408.10072},
year={2024}
}