lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
936 stars 214 forks source link

Add Chinese TTS dataset `baker`. #1304

Closed csukuangfj closed 5 months ago

csukuangfj commented 6 months ago

Example labelling file

000001  卡尔普#2陪外孙#1玩滑梯#4。
    ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1
000002  假语村言#2别再#1拥抱我#4。
    jia2 yu3 cun1 yan2 bie2 zai4 yong1 bao4 wo3
000003  宝马#1配挂#1跛骡鞍#3,貂蝉#1怨枕#2董翁榻#4。
    bao2 ma3 pei4 gua4 bo3 luo2 an1 diao1 chan2 yuan4 zhen3 dong3 weng1 ta4
000004  邓小平#2与#1撒切尔#2会晤#4。
    deng4 xiao3 ping2 yu3 sa4 qie4 er3 hui4 wu4
000005  老虎#1幼崽#2与#1宠物犬#1玩耍#4。
    lao2 hu3 you4 zai3 yu2 chong3 wu4 quan3 wan2 shua3

Example baker_zh_supervisions_all.jsonl

{"id": "000001", "recording_id": "000001", "start": 0.0, "duration": 2.66, "channel": 0, "text": "卡尔普#2陪外孙#1玩滑梯#4。", "language": "Chinese", "gender": "female", "custom": {"pinyin": "ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1", "normalized_text": "卡尔普陪外孙玩滑梯。"}}
{"id": "000002", "recording_id": "000002", "start": 0.0, "duration": 2.86, "channel": 0, "text": "假语村言#2别再#1拥抱我#4。", "language": "Chinese", "gender": "female", "custom": {"pinyin": "jia2 yu3 cun1 yan2 bie2 zai4 yong1 bao4 wo3", "normalized_text": "假语村言别再拥抱我。"}}
{"id": "000003", "recording_id": "000003", "start": 0.0, "duration": 4.4, "channel": 0, "text": "宝马#1配挂#1跛骡鞍#3,貂蝉#1怨枕#2董翁榻#4。", "language": "Chinese", "gender": "female", "custom": {"pinyin": "bao2 ma3 pei4 gua4 bo3 luo2 an1 diao1 chan2 yuan4 zhen3 dong3 weng1 ta4", "normalized_text": "宝马配挂跛骡鞍,貂蝉怨枕董翁榻。"}}
{"id": "000004", "recording_id": "000004", "start": 0.0, "duration": 2.6, "channel": 0, "text": "邓小平#2与#1撒切尔#2会晤#4。", "language": "Chinese", "gender": "female", "custom": {"pinyin": "deng4 xiao3 ping2 yu3 sa4 qie4 er3 hui4 wu4", "normalized_text": "邓小平与撒切尔会晤。"}}
{"id": "000005", "recording_id": "000005", "start": 0.0, "duration": 3.09, "channel": 0, "text": "老虎#1幼崽#2与#1宠物犬#1玩耍#4。", "language": "Chinese", "gender": "female", "custom": {"pinyin": "lao2 hu3 you4 zai3 yu2 chong3 wu4 quan3 wan2 shua3", "normalized_text": "老虎幼崽与宠物犬玩耍。"}}

Example baker_zh_recordings_all.jsonl

{"id": "000001", "sources": [{"type": "file", "channels": [0], "source": "BZNSYP/Wave/000001.wav"}], "sampling_rate": 48000, "num_samples": 127680, "duration": 2.66, "channel_ids": [0]}
{"id": "000002", "sources": [{"type": "file", "channels": [0], "source": "BZNSYP/Wave/000002.wav"}], "sampling_rate": 48000, "num_samples": 137280, "duration": 2.86, "channel_ids": [0]}
{"id": "000003", "sources": [{"type": "file", "channels": [0], "source": "BZNSYP/Wave/000003.wav"}], "sampling_rate": 48000, "num_samples": 211200, "duration": 4.4, "channel_ids": [0]}
{"id": "000004", "sources": [{"type": "file", "channels": [0], "source": "BZNSYP/Wave/000004.wav"}], "sampling_rate": 48000, "num_samples": 124800, "duration": 2.6, "channel_ids": [0]}
{"id": "000005", "sources": [{"type": "file", "channels": [0], "source": "BZNSYP/Wave/000005.wav"}], "sampling_rate": 48000, "num_samples": 148320, "duration": 3.09, "channel_ids": [0]}

Cutset info

wc -l *
  10000 baker_zh_recordings_all.jsonl
  10000 baker_zh_supervisions_all.jsonl
  20000 total
lhotse cut simple  -r ./baker_zh_recordings_all.jsonl.gz  -s ./baker_zh_supervisions_all.jsonl.gz baker_zh_cuts.jsonl.gz
total 2.5M
-rw-r--r-- 1 kuangfangjun root 1.4M Mar 13 21:21 baker_zh_cuts.jsonl.gz
-rw-r--r-- 1 kuangfangjun root 127K Mar 13 21:19 baker_zh_recordings_all.jsonl.gz
-rw-r--r-- 1 kuangfangjun root 1.1M Mar 13 21:19 baker_zh_supervisions_all.jsonl.gz

lhotse cut describe ./baker_zh_cuts.jsonl.gz
Cut statistics:
╒═══════════════════════════╤══════════╕
│ Cuts count:               │ 10000    │
├───────────────────────────┼──────────┤
│ Total duration (hh:mm:ss) │ 11:51:21 │
├───────────────────────────┼──────────┤
│ mean                      │ 4.3      │
├───────────────────────────┼──────────┤
│ std                       │ 1.3      │
├───────────────────────────┼──────────┤
│ min                       │ 1.4      │
├───────────────────────────┼──────────┤
│ 25%                       │ 3.2      │
├───────────────────────────┼──────────┤
│ 50%                       │ 4.2      │
├───────────────────────────┼──────────┤
│ 75%                       │ 5.2      │
├───────────────────────────┼──────────┤
│ 99%                       │ 7.0      │
├───────────────────────────┼──────────┤
│ 99.5%                     │ 7.3      │
├───────────────────────────┼──────────┤
│ 99.9%                     │ 7.7      │
├───────────────────────────┼──────────┤
│ max                       │ 8.3      │
├───────────────────────────┼──────────┤
│ Recordings available:     │ 10000    │
├───────────────────────────┼──────────┤
│ Features available:       │ 0        │
├───────────────────────────┼──────────┤
│ Supervisions available:   │ 10000    │
╘═══════════════════════════╧══════════╛
SUPERVISION custom fields:
Speech duration statistics:
╒══════════════════════════════╤══════════╤══════════════════════╕
│ Total speech duration        │ 11:51:21 │ 100.00% of recording │
├──────────────────────────────┼──────────┼──────────────────────┤
│ Total speaking time duration │ 11:51:21 │ 100.00% of recording │
├──────────────────────────────┼──────────┼──────────────────────┤
│ Total silence duration       │ 00:00:01 │ 0.00% of recording   │
╘══════════════════════════════╧══════════╧══════════════════════╛
pzelasko commented 5 months ago

Thanks, I missed this somehow.