Closed pzelasko closed 3 years ago
This is great! I think we should start with mini_librispeech for the ASR one (but the order of doing ASR vs. other things is up to you guys.) @jimbozhang do you think you could commit to doing one of these?
This is great! I think we should start with mini_librispeech for the ASR one (but the order of doing ASR vs. other things is up to you guys.) @jimbozhang do you think you could commit to doing one of these?
I think I can make the recipe of mini_librispeech, as a start.
Thanks!!
On Tue, Jun 23, 2020 at 3:41 PM Junbo Zhang notifications@github.com wrote:
This is great! I think we should start with mini_librispeech for the ASR one (but the order of doing ASR vs. other things is up to you guys.) @jimbozhang https://github.com/jimbozhang do you think you could commit to doing one of these?
I think I can make the recipe of (mini) Librispeech, as a start.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pzelasko/lhotse/issues/28#issuecomment-647968717, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO76V6PLK4FUBOTWSD3RYBMA3ANCNFSM4OEVX6ZA .
@pzelasko I am interested in making a data preparation example for speaker diarization. Would that fall under both source separation and speaker identification? Within kaldi the diarization recipes use a combination of VAD and speaker identification essentially.
I guess there are multiple ways to approach it. The "classical" pipeline indeed consists of two models: a VAD and a speaker ID model. We have a VadDataset
class, but not a SpeakerClassificationDataset
/SpeakerIdentificationDataset
yet. We also don't have recipes for any standard speaker ID corpora (e.g. VoxCeleb 1 and 2) yet - adding them is probably a good first step.
On the other hand PR #80 implements a DiarizationDataset
loosely inspired by the TS-VAD paper (VAD + ID in one step); it's still in progress though.
I will create a contributor guide very soon - if by then you're still interested, let us know.
Oh. I hadn't read that paper yet. It looks promising.
It would still be nice to have the "classical" pipeline implemented for comparison purposes and as an example of how to change one's kaldi speaker diarization datasets/recipes to lhotse + k2/nn architectures. I'll await the contribution guide.
I'm closing this since we're tracking the corpora in the codebase and docs already. If anybody wants a new one added, please open a separate issue or submit a PR.
We should start creating example recipes for some data sets and tasks. I'll post an initial list here, and we can modify or extend it based on discussions. I'll sort it by the level of implementation difficulty.
Source separation
Speech enhancement
Should be fairly simple to achieve with source separation in place, maybe it's even already possible.
Speaker identification
Could be simpler to build the first Dataset with supervision provided by SupervisionSegment than ASR.
TTS
ASR
Wake word detection
Feel free to propose any other tasks and datasets.