Closed AmgadHasan closed 1 month ago
Here is a draft generated my llama-3-70B.
This dataset was created with the support of the Innovation Promotion Fund as part of the federal project "Artificial Intelligence" of the national program "Digital Economy of the Russian Federation".
The links below provide .feather
files containing labeled open audio datasets using Silero VAD, as well as a brief description of each dataset with examples of loading. .feather
files can be opened using the pandas
library:
import pandas as pd
dataframe = pd.read_feather(PATH_TO_FEATHER_FILE)
Each .feather
file with labeling contains the following columns:
speech_timings
- labeling of the audio. This is a list containing dictionaries of the form {'start': START_SECOND, 'end': END_SECOND}
, where START_SECOND
and END_SECOND
are the start and end times of speech in seconds. The number of dictionaries is equal to the number of speech audio fragments found in the audio;language
- ISO code of the language of the audio.Columns containing information about loading audio files vary and are described for each dataset below.
All data is labeled with a temporal discretization of ~30 milliseconds (num_samples
- 512)
Name | Number of hours | Number of languages | Link | License | md5sum |
---|---|---|---|---|---|
Bible.is | 53,138 | 1,596 | URL | Unique | ea404eeaf2cd283b8223f63002be11f9 |
globalrecordings.net | 9,743 | 6,171[^1] | URL | CC BY-NC-SA 4.0 | 3c5c0f31b0abd9fe94ddbe8b1e2eb326 |
VoxLingua107 | 6,628 | 107 | URL | CC BY 4.0 | 5dfef33b4d091b6d399cfaf3d05f2140 |
Common Voice | 30,329 | 120 | URL | CC0 | 5e30a85126adf74a5fd1496e6ac8695d |
MLS | 50,709 | 8 | URL | CC BY 4.0 | a339d0e94bdf41bba3c003756254ac4e |
Total | 150,547 | 6,171+ |
Link to .feather
file with labeling
audio_link
column contains links to specific audio files.Link to .feather
file with labeling
folder_link
column contains links to download .zip
archives for specific languages. Note! Links to archives are duplicated, as each archive may contain multiple audio files.audio_path
column contains paths to specific audio files after unpacking the corresponding archive from the folder_link
column.The number of unique ISO codes in this dataset does not match the actual number of languages represented, as some close languages may be encoded with the same ISO code.
Link to .feather
file with labeling
folder_link
column contains links to download .zip
archives for specific languages. Note! Links to archives are duplicated, as each archive may contain multiple audio files.audio_path
column contains paths to specific audio files after unpacking the corresponding archive from the folder_link
column.Link to .feather
file with labeling
This dataset cannot be downloaded via static links. To download, go to the link and, after gaining access through the corresponding form, download archives for each available language. Note! The provided labeling is valid for version 16.1 of the original dataset.
audio_path
column contains unique names of .mp3
files obtained after downloading the corresponding dataset.Link to .feather
file with labeling
folder_link
column contains links to download .zip
archives for specific languages. Note! Links to archives are duplicated, as each archive may contain multiple audio files.audio_path
column contains paths to specific audio files after unpacking the corresponding archive from the folder_link
column.This dataset is distributed under the CC BY-NC-SA 4.0
license.
@misc{Silero VAD Dataset,
author = {Silero Team},
title = {Silero-VAD Dataset: a large public Internet-scale dataset for voice activity detection for 6000+ languages},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/snakers4/silero-vad/datasets/README.md}},
email = {hello@silero.ai}
}
[^1]: The number of unique ISO codes in this dataset does not match the actual number of languages represented, as some close languages may be encoded with the same ISO code.
Hi,
Thank you for releasing the dataset. I was going through the README but it's in Russian and I can't follow the instructions.
Can someone who speaks both Russian and English make an English version of the README? LLMs can help in translating it but it's important to get a human to review the translation especially for technical topics.
Thx
https://github.com/snakers4/silero-vad/blob/master/datasets/README.md