Create a benchmark dataset of Audio Deepfakes

Goal

To create a benchmark dataset for audio files to assist evaluation of deepfake detection tools.

Overview

During the first quarter of launch of DAU, a trend that has emerged is the presence of various manipulation techniques in audio content. This also includes video files whose audio is manipulated. As such being able to reliably identify manipulated portions of an audio file is essential. The manipulation techniques noted so far are

Splicing in synthetically generated media in a natural audio recording
Overdubbing a video with mimicry (by a human and hence no synthetic media)
Use of tools like eleven labs to generate synthetic media in a celebrity's voice using text

While work is underway to create techniques that can detect the various types of manipulation technique used in an audio file received by the DAU, we want to create a standard benchmark dataset of audio files. The goal with this dataset is to be a useful tool in evaluating performance of various proprietary and open source tools that we might use in the project.

Working Definitions

To avoid confusion, we will use the following definitions while working on this issue :

Natural Audio : Recording of a person made using a microphone and saved in a digital file
Synthetic Audio : An audio generated from scratch using techniques like Generative AI and consumer apps like midjourney, canva etc
Audio Efffects : This could be the application of any DSP technique like stretching, slowing down on a natural audio file

Scope of the task

List about 10-15 public figures split into language, accent and gender.
get their audio recording from publicly available repositories like youtube.
strip the audio and generate different versions of the audio e.g. single sentence, long speech, monologue. where applicable.
Automatically generate transcript of their speech.
Convert the transcripts back to synthetic data using open models and proprietary models. The dataset will include a column to mark how the synthetic media was generated.

Deliverable

An open dataset with the following columns

Name of the celebrity
Language being spoken in the audio
Gender
Quality of the audio
Natural or Synthetic
if Synthetic, tool used

Approach

Lets plan to work on this collaboratively. We can discuss :

which celebrity's data we are working on
which transcription tool we are using;
which tool are we using to generate synthetic audio

Having a mix of techniques and transcription tools shouldn't hurt. But it would be nice if we keep sharing our progress here so we're not solving problems that we have a working solution for.

tattle-made / feluda