To create a benchmark dataset for audio files to assist evaluation of deepfake detection tools.
Overview
During the first quarter of launch of DAU, a trend that has emerged is the presence of various manipulation techniques in audio content. This also includes video files whose audio is manipulated. As such being able to reliably identify manipulated portions of an audio file is essential. The manipulation techniques noted so far are
Splicing in synthetically generated media in a natural audio recording
Overdubbing a video with mimicry (by a human and hence no synthetic media)
Use of tools like eleven labs to generate synthetic media in a celebrity's voice using text
While work is underway to create techniques that can detect the various types of manipulation technique used in an audio file received by the DAU, we want to create a standard benchmark dataset of audio files. The goal with this dataset is to be a useful tool in evaluating performance of various proprietary and open source tools that we might use in the project.
Working Definitions
To avoid confusion, we will use the following definitions while working on this issue :
Natural Audio : Recording of a person made using a microphone and saved in a digital file
Synthetic Audio : An audio generated from scratch using techniques like Generative AI and consumer apps like midjourney, canva etc
Audio Efffects : This could be the application of any DSP technique like stretching, slowing down on a natural audio file
Scope of the task
List about 10-15 public figures split into language, accent and gender.
get their audio recording from publicly available repositories like youtube.
strip the audio and generate different versions of the audio e.g. single sentence, long speech, monologue. where applicable.
Automatically generate transcript of their speech.
Convert the transcripts back to synthetic data using open models and proprietary models. The dataset will include a column to mark how the synthetic media was generated.
Deliverable
An open dataset with the following columns
Name of the celebrity
Language being spoken in the audio
Gender
Quality of the audio
Natural or Synthetic
if Synthetic, tool used
Approach
Lets plan to work on this collaboratively. We can discuss :
which celebrity's data we are working on
which transcription tool we are using;
which tool are we using to generate synthetic audio
Having a mix of techniques and transcription tools shouldn't hurt. But it would be nice if we keep sharing our progress here so we're not solving problems that we have a working solution for.
Goal
To create a benchmark dataset for audio files to assist evaluation of deepfake detection tools.
Overview
During the first quarter of launch of DAU, a trend that has emerged is the presence of various manipulation techniques in audio content. This also includes video files whose audio is manipulated. As such being able to reliably identify manipulated portions of an audio file is essential. The manipulation techniques noted so far are
While work is underway to create techniques that can detect the various types of manipulation technique used in an audio file received by the DAU, we want to create a standard benchmark dataset of audio files. The goal with this dataset is to be a useful tool in evaluating performance of various proprietary and open source tools that we might use in the project.
Working Definitions
To avoid confusion, we will use the following definitions while working on this issue :
Scope of the task
Deliverable
An open dataset with the following columns
Approach
Lets plan to work on this collaboratively. We can discuss :
Having a mix of techniques and transcription tools shouldn't hurt. But it would be nice if we keep sharing our progress here so we're not solving problems that we have a working solution for.