tattle-made / feluda

A configurable engine for analysing multi-lingual and multi-modal content.
https://tattle.co.in/products/feluda/
GNU General Public License v3.0
14 stars 15 forks source link

Create a benchmark dataset of Audio Deepfakes #365

Open dennyabrain opened 4 months ago

dennyabrain commented 4 months ago

Goal

To create a benchmark dataset for audio files to assist evaluation of deepfake detection tools.

Overview

During the first quarter of launch of DAU, a trend that has emerged is the presence of various manipulation techniques in audio content. This also includes video files whose audio is manipulated. As such being able to reliably identify manipulated portions of an audio file is essential. The manipulation techniques noted so far are

  1. Splicing in synthetically generated media in a natural audio recording
  2. Overdubbing a video with mimicry (by a human and hence no synthetic media)
  3. Use of tools like eleven labs to generate synthetic media in a celebrity's voice using text

While work is underway to create techniques that can detect the various types of manipulation technique used in an audio file received by the DAU, we want to create a standard benchmark dataset of audio files. The goal with this dataset is to be a useful tool in evaluating performance of various proprietary and open source tools that we might use in the project.

Working Definitions

To avoid confusion, we will use the following definitions while working on this issue :

  1. Natural Audio : Recording of a person made using a microphone and saved in a digital file
  2. Synthetic Audio : An audio generated from scratch using techniques like Generative AI and consumer apps like midjourney, canva etc
  3. Audio Efffects : This could be the application of any DSP technique like stretching, slowing down on a natural audio file

Scope of the task

  1. List about 10-15 public figures split into language, accent and gender.
  2. get their audio recording from publicly available repositories like youtube.
  3. strip the audio and generate different versions of the audio e.g. single sentence, long speech, monologue. where applicable.
  4. Automatically generate transcript of their speech.
  5. Convert the transcripts back to synthetic data using open models and proprietary models. The dataset will include a column to mark how the synthetic media was generated.

Deliverable

An open dataset with the following columns

  1. Name of the celebrity
  2. Language being spoken in the audio
  3. Gender
  4. Quality of the audio
  5. Natural or Synthetic
  6. if Synthetic, tool used

Approach

Lets plan to work on this collaboratively. We can discuss :

  1. which celebrity's data we are working on
  2. which transcription tool we are using;
  3. which tool are we using to generate synthetic audio

Having a mix of techniques and transcription tools shouldn't hurt. But it would be nice if we keep sharing our progress here so we're not solving problems that we have a working solution for.

dennyabrain commented 4 months ago

@swairshah has begun preliminary exploration here - https://github.com/swairshah/audio-research