kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.32k stars 5.33k forks source link

Is there any speaker diarization documentation and already trained model? #2523

Closed bwang482 closed 5 years ago

bwang482 commented 6 years ago

Hi there, thanks for Kaldi :)

I want to perform speaker diarization on a set of audio recordings. I believe Kaldi recently added the speaker diarization feature. I have managed to find this link, however, I have not been able to figure out how to use it since there is very little documentation. Also, may I ask is there any already trained model on conversions in English that I can use off-the-shelf, please?

Thanks a lot!

anderleich commented 4 years ago

That's why I don't understand why by specifying two speakers on recording1 it gives the error I mentioned. If I update the reco2num_spk file to the following content it works. However it seems strange to say an utterance has two speakers.


utt_0001 2
utt_0002 2
utt_0003 2
...
anderleich commented 4 years ago

Another thing to notice is that the final rttm has different values for timestamps when compared to the segments file. This creates some variations in time which don't really match. Is there any tool to pass the rttm information to the segments file?

danpovey commented 4 years ago

I dont follow this stuff super closely. There could be a bug when you specify 1 speaker (b/c if there is only 1 speaker you don't really need diarization, so code might not have been tested for that).

On Fri, Sep 11, 2020 at 4:19 PM anderleich notifications@github.com wrote:

Another thing to notice is that the final rttm has different values for timestamps when compared to the segments file. This creates some variations in time which don't really match. Is there any tool to pass the rttm information to the segments file?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2523#issuecomment-690949817, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2RCHRCM4KVNDU5VNTSFHMRHANCNFSM4FHFW2BQ .

anderleich commented 4 years ago

I didn't specify just 1 speaker. I'm trying to do a two speaker diarization. However, recording id seems to be utterance id as set in the first column of the segments file.

danpovey commented 4 years ago

I think you are correct that recording-id, as referred to in diarization code, is really utterance id and what is called "utterances" in that code may be sub-segments of utterances. The idea may be that at the start, your utterances and recordings are one and the same.

On Fri, Sep 11, 2020 at 4:32 PM anderleich notifications@github.com wrote:

I didn't specify just 1 speaker. I'm trying to do a two speaker diarization. However, recording id seems to be utterance id as set in the first column of the segments file.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2523#issuecomment-690956033, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3ONIJ52G6JHQG3XI3SFHODVANCNFSM4FHFW2BQ .

anderleich commented 4 years ago

Well, thank you! I finally managed to cluster speakers. However, the results are not good. I guess http://kaldi-asr.org/models/m6 this model is based on telephone conversations. For a more general, open domain scenario, should I train a model with my own data? If so, which is the most straightforward recipe? Thanks

danpovey commented 4 years ago

Mm. It's possible to adapt these systems while retaining the same x-vector extractor by re-training the PLDA on your own data, but you need speaker-labeled data. Sorry I dont recall where an example of that would be.

On Fri, Sep 11, 2020 at 6:26 PM anderleich notifications@github.com wrote:

Well, thank you! I finally managed to cluster speakers. However, the results are not good. I guess http://kaldi-asr.org/models/m6 this model is based on telephone conversations. For a more general, open domain scenario, should I train a model with my own data? If so, which is the most straightforward recipe? Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2523#issuecomment-691014419, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO2B6TA7V4DL7BEQTILSFH3L7ANCNFSM4FHFW2BQ .

00001101-xt commented 4 years ago

You only need to use one of those PLDA models for your system. Also, if you have enough in-domain training data, you'll have better results training a new PLDA model. If your data is wideband microphone data, you might even have better luck using a different x-vector system, such as this one: http://kaldi-asr.org/models/m7. It was developed for speaker recognition, but it should work just fine for diarization as well.

In the egs/callhome_diarization, we split the evaluation dataset into two halves so that we can use one half as a development set for the other half. Callhome is split into callhome1 and callhome2. We then train a PLDA backend (let's call it backend1) on callhome1, and tune the stopping threshold so that it minimizes the error on callhome1. Then backend1 is used to diarize callhome2. Next, we do the same thing for callhome2: backend2 is developed on callhome2, and evaluated on callhome1. The concatenation at the end is so that we can evaluate on the entire dataset. It doesn't matter that the two backends would assign different labels to different speakers, since they diarized different recordings.

Regarding the short segment, I think the issue is that your SAD has determined that there's a speech segment from 24.99 to 25.43 and a separate speech segment starting at 25.51. It might be a good idea to smooth these SAD decisions earlier in the pipeline (e.g., in your SAD system itself) to avoid having adjacent segments with small gaps between them. Increasing the min-segment threshold might cause the diarization system to throw out this segment, but to me it seems preferable to keep it, and just merge it with the adjacent segment. But this stuff requires a lot of tuning to get right, and it's hard to say what the optimal strategy is without playing with the data myself.

By the way, what is this "nasa_telescopes" dataset you're using?

@david-ryan-snyder Hi, David, if I want to train an 8k SID model without much of those data, is downsampling from the wideband data the possible solution?

Thanks in advance.

talal-sen commented 3 years ago

Hi, I am still new to Kaldi. I would like to perform diarization on some of the speech samples from my own dataset which do not have any speaker labels available, so I would have to listen and compare it to what the diarization outputs. I have a questions on this:

a) Does it make sense to use a pre-trained model, such as the callhome_v2 model, as there maybe different recording conditions, dialect and possibly language? Or are we assuming that the pretrained model has learned generelizable features (xvectors) so to be able to work well even on an unseen dataset?

Thanks in advance

maham7621 commented 3 years ago

Hi, I am working on speaker diarization, I have done till clustering step, the rttm file is successfully created. AFter that when i try to cluster the plda scores using the code: if [ $stage -le 10 ]; then mkdir -p $nnet_dir/results cat $nnet_dir/xvectors_train/plda_scores_num_speakers/rttm \ | md-eval.pl -1 -c 0.25 -r $nnet_dir/xvectors_train/plda_scores_num_speakers/rttm -s - 2> $nnet_dir/results/num_spk.log \

$nnet_dir/results/DER_num_spk.txt der=$(grep -oP 'DIARIZATION\ ERROR\ =\ \K[0-9]+([.][0-9]+)?' \ $nnet_dir/results/DER_num_spk.txt) echo "Using the oracle number of speakers, DER: $der%" fi it is giving me the output as: Using the oracle number of speakers, DER: 0.00% what is the reason of this zero. or it is fine?

mmaciej2 commented 3 years ago

@maham7621 You are using the md-eval.pl scoring script to score the rttm against itself. They are the same file, so there is 0% error between them.