snakers4 / silero-vad

Silero VAD: pre-trained enterprise-grade Voice Activity Detector
MIT License
3.38k stars 353 forks source link

English version of the dataset README #454

Closed AmgadHasan closed 1 month ago

AmgadHasan commented 1 month ago

Hi,

Thank you for releasing the dataset. I was going through the README but it's in Russian and I can't follow the instructions.

Can someone who speaks both Russian and English make an English version of the README? LLMs can help in translating it but it's important to get a human to review the translation especially for technical topics.

Thx

https://github.com/snakers4/silero-vad/blob/master/datasets/README.md

AmgadHasan commented 1 month ago

Here is a draft generated my llama-3-70B.


Silero-VAD Dataset

This dataset was created with the support of the Innovation Promotion Fund as part of the federal project "Artificial Intelligence" of the national program "Digital Economy of the Russian Federation".

The links below provide .feather files containing labeled open audio datasets using Silero VAD, as well as a brief description of each dataset with examples of loading. .feather files can be opened using the pandas library:

import pandas as pd
dataframe = pd.read_feather(PATH_TO_FEATHER_FILE)

Each .feather file with labeling contains the following columns:

Columns containing information about loading audio files vary and are described for each dataset below.

All data is labeled with a temporal discretization of ~30 milliseconds (num_samples - 512)

Name Number of hours Number of languages Link License md5sum
Bible.is 53,138 1,596 URL Unique ea404eeaf2cd283b8223f63002be11f9
globalrecordings.net 9,743 6,171[^1] URL CC BY-NC-SA 4.0 3c5c0f31b0abd9fe94ddbe8b1e2eb326
VoxLingua107 6,628 107 URL CC BY 4.0 5dfef33b4d091b6d399cfaf3d05f2140
Common Voice 30,329 120 URL CC0 5e30a85126adf74a5fd1496e6ac8695d
MLS 50,709 8 URL CC BY 4.0 a339d0e94bdf41bba3c003756254ac4e
Total 150,547 6,171+

Bible.is

Link to .feather file with labeling

globalrecordings.net

Link to .feather file with labeling

The number of unique ISO codes in this dataset does not match the actual number of languages represented, as some close languages may be encoded with the same ISO code.

VoxLingua107

Link to .feather file with labeling

Common Voice

Link to .feather file with labeling

This dataset cannot be downloaded via static links. To download, go to the link and, after gaining access through the corresponding form, download archives for each available language. Note! The provided labeling is valid for version 16.1 of the original dataset.

MLS

Link to .feather file with labeling

License

This dataset is distributed under the CC BY-NC-SA 4.0 license.

Citation

@misc{Silero VAD Dataset,
  author = {Silero Team},
  title = {Silero-VAD Dataset: a large public Internet-scale dataset for voice activity detection for 6000+ languages},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad/datasets/README.md}},
  email = {hello@silero.ai}
}

[^1]: The number of unique ISO codes in this dataset does not match the actual number of languages represented, as some close languages may be encoded with the same ISO code.