summary: With the rapid advancement of technologies like text-to-speech (TTS) and
voice conversion (VC), detecting deepfake voices has become increasingly
crucial. However, both academia and industry lack a comprehensive and intuitive
benchmark for evaluating detectors. Existing datasets are limited in language
diversity and lack many manipulations encountered in real-world production
environments.
To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate
the performance of deepfake voice detectors. To build the dataset, we first
collected deepfake voices generated by 19 advanced and widely recognized
commercial tools and 15 open-source tools. We then created 38 data variants
covering six types of manipulations, constructing the evaluation dataset for
deepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200
Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12
state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of
13.50%, while all others exceeded 20%. Our findings reveal that these detectors
face significant challenges in real-world applications, with dramatically
declining performance. In addition, we conducted a user study with more than
300 participants. The results are compared with the performance of the 12
detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio,
where different detectors and humans exhibit varying identification
capabilities for deepfake voices at different deception levels, while the LALM
demonstrates no detection ability at all. Furthermore, we provide a leaderboard
for deepfake voice detection, publicly available at
{https://voicewukong.github.io}.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: VoiceWukong: Benchmarking Deepfake Voice Detection
summary: With the rapid advancement of technologies like text-to-speech (TTS) and voice conversion (VC), detecting deepfake voices has become increasingly crucial. However, both academia and industry lack a comprehensive and intuitive benchmark for evaluating detectors. Existing datasets are limited in language diversity and lack many manipulations encountered in real-world production environments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate the performance of deepfake voice detectors. To build the dataset, we first collected deepfake voices generated by 19 advanced and widely recognized commercial tools and 15 open-source tools. We then created 38 data variants covering six types of manipulations, constructing the evaluation dataset for deepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200 Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12 state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of 13.50%, while all others exceeded 20%. Our findings reveal that these detectors face significant challenges in real-world applications, with dramatically declining performance. In addition, we conducted a user study with more than 300 participants. The results are compared with the performance of the 12 detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio, where different detectors and humans exhibit varying identification capabilities for deepfake voices at different deception levels, while the LALM demonstrates no detection ability at all. Furthermore, we provide a leaderboard for deepfake voice detection, publicly available at {https://voicewukong.github.io}.
id: http://arxiv.org/abs/2409.06348v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.