Here is a draft generated my llama-3-70B.

Silero-VAD Dataset

This dataset was created with the support of the Innovation Promotion Fund as part of the federal project "Artificial Intelligence" of the national program "Digital Economy of the Russian Federation".

The links below provide .feather files containing labeled open audio datasets using Silero VAD, as well as a brief description of each dataset with examples of loading. .feather files can be opened using the pandas library:

import pandas as pd
dataframe = pd.read_feather(PATH_TO_FEATHER_FILE)

Each .feather file with labeling contains the following columns:

speech_timings - labeling of the audio. This is a list containing dictionaries of the form {'start': START_SECOND, 'end': END_SECOND}, where START_SECOND and END_SECOND are the start and end times of speech in seconds. The number of dictionaries is equal to the number of speech audio fragments found in the audio;
language - ISO code of the language of the audio.

Columns containing information about loading audio files vary and are described for each dataset below.

All data is labeled with a temporal discretization of ~30 milliseconds (num_samples - 512)

Name	Number of hours	Number of languages	Link	License	md5sum
Bible.is	53,138	1,596	URL	Unique	ea404eeaf2cd283b8223f63002be11f9
globalrecordings.net	9,743	6,171[^1]	URL	CC BY-NC-SA 4.0	3c5c0f31b0abd9fe94ddbe8b1e2eb326
VoxLingua107	6,628	107	URL	CC BY 4.0	5dfef33b4d091b6d399cfaf3d05f2140
Common Voice	30,329	120	URL	CC0	5e30a85126adf74a5fd1496e6ac8695d
MLS	50,709	8	URL	CC BY 4.0	a339d0e94bdf41bba3c003756254ac4e
Total	150,547	6,171+

Bible.is

Link to .feather file with labeling

The audio_link column contains links to specific audio files.

globalrecordings.net

Link to .feather file with labeling

The folder_link column contains links to download .zip archives for specific languages. Note! Links to archives are duplicated, as each archive may contain multiple audio files.
The audio_path column contains paths to specific audio files after unpacking the corresponding archive from the folder_link column.

The number of unique ISO codes in this dataset does not match the actual number of languages represented, as some close languages may be encoded with the same ISO code.

VoxLingua107

Link to .feather file with labeling

The folder_link column contains links to download .zip archives for specific languages. Note! Links to archives are duplicated, as each archive may contain multiple audio files.
The audio_path column contains paths to specific audio files after unpacking the corresponding archive from the folder_link column.

Common Voice

Link to .feather file with labeling

This dataset cannot be downloaded via static links. To download, go to the link and, after gaining access through the corresponding form, download archives for each available language. Note! The provided labeling is valid for version 16.1 of the original dataset.

The audio_path column contains unique names of .mp3 files obtained after downloading the corresponding dataset.

MLS

Link to .feather file with labeling

The folder_link column contains links to download .zip archives for specific languages. Note! Links to archives are duplicated, as each archive may contain multiple audio files.
The audio_path column contains paths to specific audio files after unpacking the corresponding archive from the folder_link column.

License

This dataset is distributed under the CC BY-NC-SA 4.0 license.

Citation

@misc{Silero VAD Dataset,
  author = {Silero Team},
  title = {Silero-VAD Dataset: a large public Internet-scale dataset for voice activity detection for 6000+ languages},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad/datasets/README.md}},
  email = {hello@silero.ai}
}

[^1]: The number of unique ISO codes in this dataset does not match the actual number of languages represented, as some close languages may be encoded with the same ISO code.

snakers4 / silero-vad

English version of the dataset README #454

Silero-VAD Dataset

Bible.is

globalrecordings.net

VoxLingua107

Common Voice

MLS

License

Citation