How can I prepare dataset to reproduce pyannote/segmentation@2022.07

pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

http://pyannote.github.io

MIT License

5.5k stars 726 forks source link

How can I prepare dataset to reproduce pyannote/segmentation@2022.07 #1067

Closed akuzeee closed 1 year ago

akuzeee commented 1 year ago

Hi,

I would appreciate if you could provide a procedure to prepare the dataset to reproduce pyannote/segmentation@2022.07.

In my understanding, the model was trained using the dataset made of the concatenation of three corpus, as mentioned in the paper: https://arxiv.org/pdf/2104.04045.pdf. I have found here how to prepare the AMI dataset, but have not found anything about the composite data.

hbredin commented 1 year ago

Thanks for your interest in my work.

More precisely, here are the datasets used for training:

pyannote/segmentation@2022.07: AMI, AISHELL-4, DIHARD3, REPERE, VoxConverse 0.3
pyannote/segmentation@Interspeech2021: AMI, DIHARD3, VoxConverse 0.3

AMI, AISHELL-4 and VoxConverse 0.3 are free datasets. REPERE and DIHARD3 are paid datasets.

By composite, all I meant was the union of their respective training set (as described in the paper). I might prepare a script in the future to automate the data preparation but this is definitely not at the top of my (long) TODO list.

Would you like to help contribute such a script? I'd be happy to help on the way!

akuzeee commented 1 year ago

Thank you for your very helpful response. I can't decide immediately whether to use the paid datasets, so I don't sure if I can prepare such a script, but I will let you know if I make any progress.

hbredin commented 1 year ago

FYI, @FrenchKrab created a data preparation script for AISHELL-4. It is available here: https://github.com/FrenchKrab/aishell4-pyannote