Example data preparation recipes

lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.

https://lhotse.readthedocs.io/en/latest/

Apache License 2.0

945 stars 217 forks source link

Example data preparation recipes #28

Closed pzelasko closed 3 years ago

pzelasko commented 4 years ago

We should start creating example recipes for some data sets and tasks. I'll post an initial list here, and we can modify or extend it based on discussions. I'll sort it by the level of implementation difficulty.

Source separation

[x] (mini) LibriMix

Speech enhancement

Should be fairly simple to achieve with source separation in place, maybe it's even already possible.

[ ] (mini) LibriMix

Speaker identification

Could be simpler to build the first Dataset with supervision provided by SupervisionSegment than ASR.

[ ] ? VoxCeleb 1/2 (widely studied dataset)

TTS

[x] LJSpeech

ASR

[x] (mini) Librispeech (widely studied dataset)
[ ] ? Wall Street Journal (widely studied dataset)
[x] 1997 Broadcast News
[x] TED-LIUM v3
[x] Switchboard (conversational, two-channel)
[x] AMI (conversational, multi-channel)
[ ] ? CHiME 6 (conversational, multi-channel)

Wake word detection

[x] Mobvoi HotWords

Feel free to propose any other tasks and datasets.

danpovey commented 4 years ago

This is great! I think we should start with mini_librispeech for the ASR one (but the order of doing ASR vs. other things is up to you guys.) @jimbozhang do you think you could commit to doing one of these?

jimbozhang commented 4 years ago

This is great! I think we should start with mini_librispeech for the ASR one (but the order of doing ASR vs. other things is up to you guys.) @jimbozhang do you think you could commit to doing one of these?

I think I can make the recipe of mini_librispeech, as a start.

danpovey commented 4 years ago

Thanks!!

On Tue, Jun 23, 2020 at 3:41 PM Junbo Zhang notifications@github.com wrote:

This is great! I think we should start with mini_librispeech for the ASR one (but the order of doing ASR vs. other things is up to you guys.) @jimbozhang https://github.com/jimbozhang do you think you could commit to doing one of these?

I think I can make the recipe of (mini) Librispeech, as a start.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pzelasko/lhotse/issues/28#issuecomment-647968717, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO76V6PLK4FUBOTWSD3RYBMA3ANCNFSM4OEVX6ZA .

judyfong commented 4 years ago

@pzelasko I am interested in making a data preparation example for speaker diarization. Would that fall under both source separation and speaker identification? Within kaldi the diarization recipes use a combination of VAD and speaker identification essentially.

pzelasko commented 4 years ago

I guess there are multiple ways to approach it. The "classical" pipeline indeed consists of two models: a VAD and a speaker ID model. We have a VadDataset class, but not a SpeakerClassificationDataset/SpeakerIdentificationDataset yet. We also don't have recipes for any standard speaker ID corpora (e.g. VoxCeleb 1 and 2) yet - adding them is probably a good first step.

On the other hand PR #80 implements a DiarizationDataset loosely inspired by the TS-VAD paper (VAD + ID in one step); it's still in progress though.

I will create a contributor guide very soon - if by then you're still interested, let us know.

judyfong commented 4 years ago

Oh. I hadn't read that paper yet. It looks promising.

It would still be nice to have the "classical" pipeline implemented for comparison purposes and as an example of how to change one's kaldi speaker diarization datasets/recipes to lhotse + k2/nn architectures. I'll await the contribution guide.

pzelasko commented 3 years ago

I'm closing this since we're tracking the corpora in the codebase and docs already. If anybody wants a new one added, please open a separate issue or submit a PR.