Youtube Speech Data Generator

A python library to generate speech dataset. Youtube Speech Data Generator also takes care of almost all your speech data preprocessing needed to build a speech dataset along with their transcriptions making sure it follows a directory structure followed by most of the text-to-speech architectures.

Installation

Make sure ffmpeg is installed and is set to the system path.

$ pip install youtube-tts-data-generator

Minimal start for creating the dataset

from youtube_tts_data_generator import YTSpeechDataGenerator

# First create a YTSpeechDataGenerator instance:

generator = YTSpeechDataGenerator(dataset_name='elon')

# Now create a '.txt' file that contains a list of YouTube videos that contains speeches.
# NOTE - Make sure you choose videos with subtitles.

generator.prepare_dataset('links.txt')
# The above will take care about creating your dataset, creating a metadata file and trimming silence from the audios.

Usage

Initializing the generator: generator = YTSpeechDataGenerator(dataset_name='your_dataset',lang='en')
- Parameters:
- dataset_name:
  - The name of the dataset you'd like to give.
  - A directory structure like this will be created:
```
├───your_dataset
│   ├───txts
│   └───wavs
└───your_dataset_prep
├───concatenated
├───downloaded
└───split
```
- output_type:
  - The type of the metadata to be created after the dataset has been generated.
  - Supported types: csv/json
  - Default output type is set to csv
  - The csv file follows the format of LJ Speech Dataset
  - The json file follows this format:
```
{
"your_dataset1.wav": "This is an example text",
"your_dataset2.wav": "This is an another example text",
}
```
- keep_audio_extension:
  - Whether to keep the audio file extension in the metadata file
  - Default value is set to False
- lang:
  - The key for the target language in which the subtitles have to be downloaded.
  - Default value is set to en
  - Tip - check list of available languages and their keys using: generator.get_available_langs()
- sr:
  - Sample Rate to keep of the audios.
  - Default value is set to 22050
Methods:
- download():
- Downloads video files from YouTube along with their subtitles and saves them as wav files.
- Parameters:
  - links_txt:
  - Path to the '.txt' file that contains the urls for the videos.
- Usage of this method is optional. If you do not use this method, make sure to place all the audio and subtitle files in 'your_dataset_prep/downloaded' directory.
- Then, create a file called 'files.txt' and again place it under 'your_dataset_prep/downloaded'. 'files.txt' should follow the following format:
```
filename,subtitle,trim_min_begin,trim_min_end
audio.wav,subtitle.srt,0,0
audio2.wav,subtitle.vtt,5,6
```
- Create a '.txt' file that contains a list of YouTube videos that contains speeches.
- Example - generator.download('links.txt')
- split_audios():
- This method splits all the wav files into smaller chunks according to the duration of the text in the subtitles.
- Saves transcriptions as '.txt' file for each of the chunks.
- Example - generator.split_audios()
- concat_audios():
- Since the split audios are based on the duration of their subtitles, they might not be so long. This method joins the split files into recognizable ones.
- Parameters:
  - max_limit:
  - The upper limit of length of the audios that should be concated. The rest will be kept as they are.
  - The default value is set to 7
  - concat_count:
  - The number of consecutive audios that should be concated together.
  - The default value is set to 2
- Example - generator.concat_audios()
- finalize_dataset():
- Trims silence the joined audios since the data has been collected from YouTube and generates the final dataset after finishing all the preprocessing.
- Parameters:
  - min_audio_length:
  - The minumum length of the speech that should be kept. The rest will be ignored.
  - The default value is set set to 5.
  - max_audio_length:
  - The maximum length of the speech that should be kept. The rest will be ignored.
  - The default value is set set to 14.
- Example - generator.finalize_dataset(min_audio_length=6)
- get_available_langs():
- Get list of available languages in which the subtitles can be downloaded.
- Example - generator.get_available_langs()
- get_total_audio_length():
- Returns the total amount of preprocessed speech data collected by the generator.
- Example - generator.get_total_audio_length()
- prepare_dataset():
- A wrapper method for download(),split_audios(),concat_audios() and finalize_dataset().
- If you do not wish to use the above methods, you can directly call prepare_dataset(). It will handle all your data generation.
- Parameters:
  - links_txt:
  - Path to the '.txt' file that contains the urls for the videos.
  - sr:
  - Sample Rate to keep of the audios.
  - Default value is set to 22050
  - download_youtube_data:
  - Whether to download audios from YouTube.
  - Default value is True
  - max_concat_limit:
  - The upper limit of length of the audios that should be concated. The rest will be kept as they are.
  - The default value is set to 7
  - concat_count:
  - The number of consecutive audios that should be concated together.
  - The default value is set to 2
  - min_audio_length:
  - The minumum length of the speech that should be kept. The rest will be ignored.
  - The default value is set set to 5.
  - max_audio_length:
  - The maximum length of the speech that should be kept. The rest will be ignored.
  - The default value is set set to 14.
- Example - generator.prepare_dataset(links_txt='links.txt', download_youtube_data=True, min_audio_length=6)

Final dataset structure

Once the dataset has been created, the structure under 'your_dataset' directory should look like:

your_dataset
├───txts
│   ├───your_dataset1.txt
│   └───your_dataset2.txt
├───wavs
│    ├───your_dataset1.wav
│    └───your_dataset2.wav
└───metadata.csv/alignment.json

NOTE - audio.py is highly based on Real Time Voice Cloning

References

SRT to JSON

Read more about the library here

hetpandya / youtube_tts_data_generator