lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
956 stars 219 forks source link

Add the Emilia corpus #1404

Closed csukuangfj closed 1 month ago

csukuangfj commented 1 month ago

The corpus can be downloaded from https://huggingface.co/datasets/amphion/Emilia-Dataset


Usage example

(py38) kuangfangjun:recipes$ mkdir ~/t5
(py38) kuangfangjun:recipes$ time lhotse prepare emilia --num-jobs 30 --lang de /star-fj/fangjun/data/tts/emilia/download/Amphion___Emilia /star-fj/fa
ngjun/t5/
Processing /star-fj/fangjun/data/tts/emilia/download/Amphion___Emilia/raw/DE/DE_B00000.jsonl with 30 jobs: 166901it [00:41, 4024.66it/s]
Processing /star-fj/fangjun/data/tts/emilia/download/Amphion___Emilia/raw/DE/DE_B00001.jsonl with 30 jobs: 152274it [00:33, 4592.59it/s]
Processing /star-fj/fangjun/data/tts/emilia/download/Amphion___Emilia/raw/DE/DE_B00002.jsonl with 30 jobs: 167050it [00:49, 3391.75it/s]
Processing /star-fj/fangjun/data/tts/emilia/download/Amphion___Emilia/raw/DE/DE_B00003.jsonl with 30 jobs: 45364it [00:09, 4747.71it/s]
Processing /star-fj/fangjun/data/tts/emilia/download/Amphion___Emilia/raw/DE/DE_B00004.jsonl with 30 jobs: 48257it [00:35, 1378.46it/s]
Processing /star-fj/fangjun/data/tts/emilia/download/Amphion___Emilia/raw/DE/DE_B00005.jsonl with 30 jobs: 32706it [00:02, 13124.74it/s]
Processing /star-fj/fangjun/data/tts/emilia/download/Amphion___Emilia/raw/DE/DE_B00006.jsonl with 30 jobs: 22127it [00:05, 4183.53it/s]
Processing /star-fj/fangjun/data/tts/emilia/download/Amphion___Emilia/raw/DE/DE_B00007.jsonl with 30 jobs: 14712it [00:01, 13699.85it/s]
Processing /star-fj/fangjun/data/tts/emilia/download/Amphion___Emilia/raw/DE/DE_B00008.jsonl with 30 jobs: 3718it [00:00, 13919.69it/s]
Collecting futures: 100%|_________________________________________________________________________________| 653109/653109 [00:04<00:00, 144918.00it/s]

real    4m15.395s
user    4m30.221s
sys     0m57.310s

(Note: Only audio files of DE_B00000.jsonl and DE_B00001.jsonl in the above example are extracted to the corpus directory).

Generated files are

(py38) kuangfangjun:recipes$ cd ~/t5
(py38) kuangfangjun:t5$ ls -lh
total 30M
-rw-r--r-- 1 kuangfangjun root 30M Oct 21 15:57 emilia_cuts_DE.jsonl.gz
(py38) kuangfangjun:t5$ lhotse cut describe ./emilia_cuts_DE.jsonl.gz

prints

Cut statistics:
_________________________________________
_ Cuts count:               _ 319175    _
_________________________________________
_ Total duration (hh:mm:ss) _ 766:12:03 _
_________________________________________
_ mean                      _ 8.6       _
_________________________________________
_ std                       _ 5.0       _
_________________________________________
_ min                       _ 3.0       _
_________________________________________
_ 25%                       _ 4.7       _
_________________________________________
_ 50%                       _ 7.2       _
_________________________________________
_ 75%                       _ 11.3      _
_________________________________________
_ 99%                       _ 24.2      _
_________________________________________
_ 99.5%                     _ 28.0      _
_________________________________________
_ 99.9%                     _ 29.6      _
_________________________________________
_ max                       _ 30.0      _
_________________________________________
_ Recordings available:     _ 319175    _
_________________________________________
_ Features available:       _ 0         _
_________________________________________
_ Supervisions available:   _ 319175    _
_________________________________________
SUPERVISION custom fields:
- dnsmos (in 319175 cuts)
Speech duration statistics:
___________________________________________________________________
_ Total speech duration        _ 766:12:03 _ 100.00% of recording _
___________________________________________________________________
_ Total speaking time duration _ 766:12:04 _ 100.00% of recording _
___________________________________________________________________
_ Total silence duration       _ 00:00:01  _ 0.00% of recording   _
___________________________________________________________________