Is there any speaker diarization documentation and already trained model?

kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.

http://kaldi-asr.org

Other

14.31k stars 5.32k forks source link

Is there any speaker diarization documentation and already trained model? #2523

Closed bwang482 closed 5 years ago

bwang482 commented 6 years ago

Hi there, thanks for Kaldi :)

I want to perform speaker diarization on a set of audio recordings. I believe Kaldi recently added the speaker diarization feature. I have managed to find this link, however, I have not been able to figure out how to use it since there is very little documentation. Also, may I ask is there any already trained model on conversions in English that I can use off-the-shelf, please?

Thanks a lot!

iacoshoria commented 6 years ago

Any updates on this? I'm looking as well for a proper documentation/usage examples. I'm mostly interested in some hints on how to get the pre-trained model running.

david-ryan-snyder commented 6 years ago

@bluemonk482 there is a pretrained model at http://kaldi-asr.org/models/m6.

We agree that there needs to be better documentation for this. We're discussing how best to do this.

@iacoshoria For now, the best usage example for the pretrained model is the recipe that generated it: https://github.com/kaldi-asr/kaldi/blob/master/egs/callhome_diarization/v2/run.sh . Have you looked at this recipe? Could you tell us what you've tried and where (e.g., which stage) you're getting lost?

iacoshoria commented 6 years ago

@david-ryan-snyder I've looked at the recipe, the issue I'm facing is that I don't have access to the LDC corpora dataset, and the recipe is somewhat bound to that specific data.

Since the x-vector DNN is already trained in the model, I'm currently trying to run the recipe against my own test data (random audio sequences of multiple speakers)

I've found out the only stages that I need to run are: computing features, extracting and clustering x-vectors, but I have difficulties getting the data in the required format.

Are there any examples/hints of getting something similar to work? Would greatly appreciate.

david-ryan-snyder commented 6 years ago

@iacoshoria the recipe is not bound to this dataset. We are talking about making a diarization recipe based on some freely available dataset (such as AMI), but that will probably not be in the main branch of Kaldi any time soon (@mmaciej2, do you have an easy to follow AMI recipe somewhere you can point to?).

I recently ran diarization on the "Speakers in the Wild" dataset. I'll show you a few lines of each of the files I created in order to diarize it. I will then go over the steps needed to diarize a generic dataset. Hopefully you can generalize this to your own.

wav.scp This file is easy to create if you are familiar with Kaldi data preparation. It's in the form of <recording-id> <wav-file>. In my directory, it happens to look like this: aahtm sox -t flac /export/common/data/corpora/NIST/SRE/sitw_database.v4/dev/audio/aahtm.flac -t wav -r 16k -b 16 - channels 1 | aaoao sox -t flac /export/common/data/corpora/NIST/SRE/sitw_database.v4/dev/audio/aaoao.flac -t wav -r 16k -b 16 - channels 1 | abvwc sox -t flac /export/common/data/corpora/NIST/SRE/sitw_database.v4/dev/audio/abvwc.flac -t wav -r 16k -b 16 - channels 1 | adfcc sox -t flac /export/common/data/corpora/NIST/SRE/sitw_database.v4/dev/audio/adfcc.flac -t wav -r 16k -b 16 - channels 1 | adzuy sox -t flac /export/common/data/corpora/NIST/SRE/sitw_database.v4/dev/audio/adzuy.flac -t wav -r 16k -b 16 - channels 1 |

segments This will be the hardest file to generate. It's of the form <segment-id> <recording-id> <start-time> <end-time>. The egs/callhome_diarization recipe is concerned only with speaker diarization and we assume that this file has already been created by a speech activity detection system (SAD). However, in general, you will need to create this file yourself. You have several options. The easiest option (which will probably be perfectly adequate on relatively clean audio) is to use the energy-based SAD (e.g., run sid/compute_vad_decision.sh) to create frame-level speech/nonspeech decisions. Contiguous speech frames are then combined to create the segments you need using diarization/vad_to_segments.sh. Another option is to use an off the shelf SAD system. @vimalmanohar uploaded a model here: http://kaldi-asr.org/models/m4. Bear in mind you'd need to create a separate set of features for this system.

Regardless of what you use to compute segments, the file should look something like this: aahtm_0000 aahtm 0.03 2.17 aahtm_0001 aahtm 3.15 5.33 aahtm_0002 aahtm 6.79 7.78 aahtm_0003 aahtm 8.15 10.15 aahtm_0004 aahtm 10.48 13.45

utt2spk A file of the form <segment-id> <recording-id>, which you can create by just running awk '{$1, $2}' segments > utt2spk

aahtm_0000 aahtm aahtm_0001 aahtm aahtm_0002 aahtm aahtm_0003 aahtm aahtm_0004 aahtm

spk2utt This file is created from the utt2spk file running utils/utt2spk_to_spk2utt.pl utt2spk > spk2utt. Or you can run utils/fix_data_dir.sh on your directory.

aahtm aahtm_0000 aahtm_0001 aahtm_0002 aahtm_0003 aahtm_0004 aahtm_0005 aahtm_0006 aahtm_0007 aahtm_0008 aahtm_0009 aahtm_0010 aahtm_0011 aahtm_0012 aahtm_0013 aahtm_0014 aahtm_0015 aahtm_0016 aahtm_0017 aahtm_0018 aahtm_0019 aahtm_0020 aahtm_0021 aahtm_0022 aahtm_0023 aahtm_0024 aahtm_0025 aahtm_0026 aahtm_0027 aahtm_0028 aahtm_0029

Now that your data is prepared, I'll try to walk you through the remaining steps. I'm using the variable $name in place of your dataset, so you might be able to just copy and paste these lines of code, and set the $name variable to whatever your dataset is called.

Make Features You'll need to run steps/make_mfcc.sh using the appropriate MFCC configuration given by the model you're using (it'll be in the conf directory). steps/make_mfcc.sh --mfcc-config conf/mfcc.conf --nj 60 \ --cmd "$train_cmd_intel" --write-utt2num-frames true \ data/$name exp/make_mfcc $mfccdir utils/fix_data_dir.sh data/$name

Then run local/nnet3/xvector/prepare_feats.sh to apply sliding window CMVN to the data and dump it to disk. local/nnet3/xvector/prepare_feats.sh --nj 60 --cmd "$train_cmd_intel" \ data/$name data/${name}_cmn exp/${name}_cmn Copy the segments file to the new mean-normalized data directory: cp data/$name/segments data/${name}_cmn/ utils/fix_data_dir.sh data/${name}_cmn

Extract Embeddings Now we're going to extract embeddings from subsegments of the segments file. I usually use the configuration below, where the embeddings are extracted from an (at most) 1.5 second window, with a 50% overlap between subsegments (since the period is 0.75 of a second). diarization/nnet3/xvector/extract_xvectors.sh --cmd "$train_cmd_intel --mem 5G" \ --nj 60 --window 1.5 --period 0.75 --apply-cmn false \ --min-segment 0.5 $nnet_dir \ data/${name}_cmn $nnet_dir/xvectors_${name}

Perform PLDA scoring Now that we have many embeddings extracted from short overlapping windows of the recordings, we use PLDA to compute their pair-wise similarity. If you use a pretrained model, look in the README to see where a PLDA model lives. diarization/nnet3/xvector/score_plda.sh --cmd "$train_cmd_intel --mem 4G" \ --target-energy 0.9 --nj 20 $nnet_dir/xvectors_plda/ \ $nnet_dir/xvectors_$name \ $nnet_dir/xvectors_$name/plda_scores

Cluster Speakers The previous step gave you a matrix of similarity scores between each segment. This next step performs agglomerative hierarchical clustering to partition the recording into speech belonging to different speakers.

If you know know how many speakers are in your recordings (say it's summed channel telephone speech, you can assume there's probably 2 speakers) you can supply a file called reco2num_spk to the option --reco2num-spk. This is a file of the form <recording-id> <number-of-speakers-in-that-recording>. diarization/cluster.sh --cmd "$train_cmd_intel --mem 4G" --nj 20 \ --reco2num-spk data/$name/reco2num_spk \ $nnet_dir/xvectors_$name/plda_scores \ $nnet_dir/xvectors_$name/plda_scores_num_speakers

Or, you may not know how many speakers are in a recording (say it's a random online video). Then you'll need to specify a threshold at which to stop clustering. E.g., once the pair-wise similarity of the embeddings drops below this threshold, stop clustering. You might obtain this by finding a threshold that minimizes the diarization error rate (DER) on a development set. But, this won't be possible if you don't have segment-level labels for a dataset. If you don't have these labels, @dpovey suggested clustering a set of in-domain data, and tuning the threshold until it gives you the average number of speakers per recording that you expect (e.g., you might expect that there's on average 2 speakers per recording, but sometimes more or less). diarization/cluster.sh --cmd "$train_cmd_intel --mem 4G" --nj 40 \ --threshold $threshold \ $nnet_dir/xvectors_$name/plda_scores \ $nnet_dir/xvectors_$name/plda_scores_threshold_${threshold}

Diarized Speech If everything went well, you should have a file called rttm in the directory $nnetdir/xvectors$name/plda_scoresthreshold${threshold}/. The 2nd column is the recording ID, the 3rd column is the start-time of a segment, and the 4th is the time offset. The 8th column is the speaker label assigned to that segment.

SPEAKER mfcny 0 86.200 16.400 <NA> <NA> 1 <NA> <NA> SPEAKER mfcny 0 103.050 5.830 <NA> <NA> 1 <NA> <NA> SPEAKER mfcny 0 109.230 4.270 <NA> <NA> 1 <NA> <NA> SPEAKER mfcny 0 113.760 8.625 <NA> <NA> 1 <NA> <NA> SPEAKER mfcny 0 122.385 4.525 <NA> <NA> 2 <NA> <NA> SPEAKER mfcny 0 127.230 6.230 <NA> <NA> 2 <NA> <NA> SPEAKER mfcny 0 133.820 0.850 <NA> <NA> 2 <NA> <NA>

mmaciej2 commented 6 years ago

@david-ryan-snyder I do not currently have an easy-to-follow AMI recipe. I was in the process of reworking it to be fairly simple and use VoxCeleb training data when I got distracted by more-pressing work.

There is an AMI recipe here: https://github.com/mmaciej2/kaldi/tree/ami-diarization/egs/ami_diarizaiton/v1 I won't guarantee anything, but I think it does run end-to-end and produce very mediocre results (mainly due to AMI alone being insufficient for training). I think David's above description of input files is probably the best resource here on how to get a system running, though.

iacoshoria commented 6 years ago

@david-ryan-snyder Thank you for this very comprehensive guide in getting things to work! It was really helpful.

I started off on wrong foot right in the data preparation step, since I needed a version of the data that doesn't know anything about speech segments, to run the segmentation step. That meant adding in both utt2spk and spk2utt just a simple mapping from recording_id to recording_id.

One more thing, I'm using your pre-trained model, which has the PLDA models split between the two data segments callhome_1 and callhome_2. I ran the PLDA scoring against both models, and I get similar results for both of them, but different labels for each speaker.

In the recipe there's a final step that combines the results from the two, and evaluates them together, should this be the case? https://github.com/kaldi-asr/kaldi/blob/master/egs/callhome_diarization/v2/run.sh#L339-L352

Bellow are snippets from the two result sets:

SPEAKER nasa_telescopes 0   0.000   7.000 <NA> <NA> 5 <NA> <NA>
SPEAKER nasa_telescopes 0   7.080   9.090 <NA> <NA> 5 <NA> <NA>
SPEAKER nasa_telescopes 0  16.380   4.150 <NA> <NA> 5 <NA> <NA>
SPEAKER nasa_telescopes 0  20.590   4.360 <NA> <NA> 5 <NA> <NA>
SPEAKER nasa_telescopes 0  24.990   0.440 <NA> <NA> 1 <NA> <NA>
SPEAKER nasa_telescopes 0  25.510  26.360 <NA> <NA> 5 <NA> <NA>

SPEAKER nasa_telescopes 0   0.000   7.000 <NA> <NA> 2 <NA> <NA>
SPEAKER nasa_telescopes 0   7.080   8.625 <NA> <NA> 2 <NA> <NA>
SPEAKER nasa_telescopes 0  15.705   0.465 <NA> <NA> 6 <NA> <NA>
SPEAKER nasa_telescopes 0  16.380   4.150 <NA> <NA> 2 <NA> <NA>
SPEAKER nasa_telescopes 0  20.590   4.360 <NA> <NA> 2 <NA> <NA>
SPEAKER nasa_telescopes 0  24.990   0.440 <NA> <NA> 4 <NA> <NA>
SPEAKER nasa_telescopes 0  25.510  26.360 <NA> <NA> 2 <NA> <NA>

So, my question is: Should I only run the evaluation against a single PLDA model?

Also, there are a few false positives, such as the following (from the first example):

SPEAKER nasa_telescopes 0  24.990   0.440 <NA> <NA> 1 <NA> <NA>

Mostly short, under half a second. Should I increase the window/min-segment threshold, to filter out such entries?

david-ryan-snyder commented 6 years ago

You only need to use one of those PLDA models for your system. Also, if you have enough in-domain training data, you'll have better results training a new PLDA model. If your data is wideband microphone data, you might even have better luck using a different x-vector system, such as this one: http://kaldi-asr.org/models/m7. It was developed for speaker recognition, but it should work just fine for diarization as well.

In the egs/callhome_diarization, we split the evaluation dataset into two halves so that we can use one half as a development set for the other half. Callhome is split into callhome1 and callhome2. We then train a PLDA backend (let's call it backend1) on callhome1, and tune the stopping threshold so that it minimizes the error on callhome1. Then backend1 is used to diarize callhome2. Next, we do the same thing for callhome2: backend2 is developed on callhome2, and evaluated on callhome1. The concatenation at the end is so that we can evaluate on the entire dataset. It doesn't matter that the two backends would assign different labels to different speakers, since they diarized different recordings.

Regarding the short segment, I think the issue is that your SAD has determined that there's a speech segment from 24.99 to 25.43 and a separate speech segment starting at 25.51. It might be a good idea to smooth these SAD decisions earlier in the pipeline (e.g., in your SAD system itself) to avoid having adjacent segments with small gaps between them. Increasing the min-segment threshold might cause the diarization system to throw out this segment, but to me it seems preferable to keep it, and just merge it with the adjacent segment. But this stuff requires a lot of tuning to get right, and it's hard to say what the optimal strategy is without playing with the data myself.

By the way, what is this "nasa_telescopes" dataset you're using?

iacoshoria commented 6 years ago

Thank you for the suggested approach, will do try and switch the SAD system altogether, seeing that the noise comes early in the segmentation step.

The data is some random video I found having clean segments of voice between different speakers (eg: https://youtu.be/UkaNtpmoVI0?t=4250). I believe this video was one of the best-case scenario I found to test the diarization on. But I still need to do some processing to get it in the required format for the PLDA model (1 channel, sample rate 8k) And seeing that I don't know in advance how many speakers are in the recording, I need a good, generic way to determine the threshold that minimizes the diarization error rate.

Thank you for your support!

akshatdewan commented 6 years ago

Hi there! I am trying to test the pre-trained models ( http://kaldi-asr.org/models/m7) on 16khz speech audio but I am getting an error when I run the extract_xvectors.sh, so I run the nnet3-copy in isolation and get the following message -

 dewan@dewan-desktop:/data/s2t/speech_tools/kaldi/egs/callhome_diarization/v2$ nnet3-copy --nnet-config=models/0007_voxceleb_v2_1a/exp/xvector_nnet_1a/extract.config models/0007_voxceleb_v2_1a/exp/xvector_nnet_1a/final.raw test
nnet3-copy --nnet-config=models/0007_voxceleb_v2_1a/exp/xvector_nnet_1a/extract.config models/0007_voxceleb_v2_1a/exp/xvector_nnet_1a/final.raw test 
ERROR (nnet3-copy[5.3.112~1-c52ee]:Read():nnet-component-itf.cc:473) Expected token </RectifiedLinearComponent>, got <OderivRms>

[ Stack-Trace: ]

kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
kaldi::nnet3::NonlinearComponent::Read(std::istream&, bool)
kaldi::nnet3::Component::ReadNew(std::istream&, bool)
kaldi::nnet3::Nnet::Read(std::istream&, bool)
void kaldi::ReadKaldiObject<kaldi::nnet3::Nnet>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, kaldi::nnet3::Nnet*)
main
__libc_start_main
_start

Do you have an idea as to what might be causing this error? When I run the same command on the models in http://kaldi-asr.org/models/m6, it works like a charm.

dewan@dewan-desktop:/data/s2t/speech_tools/kaldi/egs/callhome_diarization/v2$ nnet3-copy --nnet-config=models/0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a/extract.config models/0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a/final.raw  test
nnet3-copy --nnet-config=models/0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a/extract.config models/0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a/final.raw test 
WARNING (nnet3-copy[5.3.112~1-c52ee]:Check():nnet-nnet.cc:789) Node tdnn6.relu is never used to compute any output.
WARNING (nnet3-copy[5.3.112~1-c52ee]:Check():nnet-nnet.cc:789) Node tdnn6.batchnorm is never used to compute any output.
WARNING (nnet3-copy[5.3.112~1-c52ee]:Check():nnet-nnet.cc:789) Node output.affine is never used to compute any output.
WARNING (nnet3-copy[5.3.112~1-c52ee]:Check():nnet-nnet.cc:789) Node output.log-softmax is never used to compute any output.
LOG (nnet3-copy[5.3.112~1-c52ee]:main():nnet3-copy.cc:114) Copied raw neural net from models/0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a/final.raw to test

Many thanks!

david-ryan-snyder commented 6 years ago

I just downloaded the model, and nnet3-copy works using a newer version of Kaldi.

Could you create a new branch with the latest changes from upstream, and see if you still have this issue there?

akshatdewan commented 6 years ago

Many thanks! I took the latest kaldi master branch version (8e30fddb300a87e7c79ef2c0b9c731a8a9fd23f0) and recompiled everything. It works fine now.

I know that OP has already asked this question but I wanted to add something - I am using (https://github.com/wiseman/py-webrtcvad) followed by (https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/utils/segmentation.pl) for creating speech segments and none of my segments are shorter than 10 seconds long (and there are no small gaps between segments) but in my speaker segmentation I am getting several very short "sub-segments". Can you suggest a solution? (I tried playing with different thresholds - 1,e-1, e-2, e-3, - different number of speakers - 2,3,4 - but I get similar results all the time).

segments

ENG_2018_09_24_AM_800kbps_NCH_v2-00000000-00001080 ENG_2018_09_24_AM_800kbps_NCH_v2 0.00 10.80
ENG_2018_09_24_AM_800kbps_NCH_v2-00001080-00002355 ENG_2018_09_24_AM_800kbps_NCH_v2 10.80 23.55
ENG_2018_09_24_AM_800kbps_NCH_v2-00002355-00003702 ENG_2018_09_24_AM_800kbps_NCH_v2 23.55 37.02
ENG_2018_09_24_AM_800kbps_NCH_v2-00003702-00004767 ENG_2018_09_24_AM_800kbps_NCH_v2 37.02 47.67
ENG_2018_09_24_AM_800kbps_NCH_v2-00004767-00006264 ENG_2018_09_24_AM_800kbps_NCH_v2 47.67 62.64
ENG_2018_09_24_AM_800kbps_NCH_v2-00006264-00007673 ENG_2018_09_24_AM_800kbps_NCH_v2 62.64 76.74
ENG_2018_09_24_AM_800kbps_NCH_v2-00007673-00009147 ENG_2018_09_24_AM_800kbps_NCH_v2 76.74 91.47
ENG_2018_09_24_AM_800kbps_NCH_v2-00009147-00009567 ENG_2018_09_24_AM_800kbps_NCH_v2 91.47 95.67
ENG_2018_09_24_AM_800kbps_NCH_v2-00009567-00010758 ENG_2018_09_24_AM_800kbps_NCH_v2 95.67 107.58
ENG_2018_09_24_AM_800kbps_NCH_v2-00013422-00014891 ENG_2018_09_24_AM_800kbps_NCH_v2 134.22 148.92
ENG_2018_09_24_AM_800kbps_NCH_v2-00014891-00016383 ENG_2018_09_24_AM_800kbps_NCH_v2 148.92 163.83
ENG_2018_09_24_AM_800kbps_NCH_v2-00016383-00017862 ENG_2018_09_24_AM_800kbps_NCH_v2 163.83 178.62
ENG_2018_09_24_AM_800kbps_NCH_v2-00017862-00019263 ENG_2018_09_24_AM_800kbps_NCH_v2 178.62 192.63
ENG_2018_09_24_AM_800kbps_NCH_v2-00019263-00020601 ENG_2018_09_24_AM_800kbps_NCH_v2 192.63 206.01
ENG_2018_09_24_AM_800kbps_NCH_v2-00020601-00022008 ENG_2018_09_24_AM_800kbps_NCH_v2 206.01 220.08
ENG_2018_09_24_AM_800kbps_NCH_v2-00022008-00023373 ENG_2018_09_24_AM_800kbps_NCH_v2 220.08 233.73
ENG_2018_09_24_AM_800kbps_NCH_v2-00023373-00024828 ENG_2018_09_24_AM_800kbps_NCH_v2 233.73 248.28
ENG_2018_09_24_AM_800kbps_NCH_v2-00024828-00026316 ENG_2018_09_24_AM_800kbps_NCH_v2 248.28 263.16
ENG_2018_09_24_AM_800kbps_NCH_v2-00026316-00027786 ENG_2018_09_24_AM_800kbps_NCH_v2 263.16 277.86
ENG_2018_09_24_AM_800kbps_NCH_v2-00027786-00029163 ENG_2018_09_24_AM_800kbps_NCH_v2 277.86 291.63
ENG_2018_09_24_AM_800kbps_NCH_v2-00029163-00030464 ENG_2018_09_24_AM_800kbps_NCH_v2 291.63 304.65
ENG_2018_09_24_AM_800kbps_NCH_v2-00030464-00031827 ENG_2018_09_24_AM_800kbps_NCH_v2 304.65 318.27
ENG_2018_09_24_AM_800kbps_NCH_v2-00031827-00033117 ENG_2018_09_24_AM_800kbps_NCH_v2 318.27 331.17
ENG_2018_09_24_AM_800kbps_NCH_v2-00033117-00034605 ENG_2018_09_24_AM_800kbps_NCH_v2 331.17 346.05
ENG_2018_09_24_AM_800kbps_NCH_v2-00034605-00036036 ENG_2018_09_24_AM_800kbps_NCH_v2 346.05 360.36

rttm

SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0   0.000   2.625 <NA> <NA> 29 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0   2.625   1.500 <NA> <NA> 6 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0   4.125  17.550 <NA> <NA> 33 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0  21.675   1.875 <NA> <NA> 15 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0  23.550  23.595 <NA> <NA> 33 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0  47.145   0.525 <NA> <NA> 29 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0  47.670  13.875 <NA> <NA> 33 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0  61.545   1.095 <NA> <NA> 15 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0  62.640  36.405 <NA> <NA> 33 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0  99.045   8.535 <NA> <NA> 29 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 134.220   7.875 <NA> <NA> 29 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 142.095  77.040 <NA> <NA> 28 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 219.135   0.945 <NA> <NA> 29 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 220.080   4.875 <NA> <NA> 28 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 224.955   2.250 <NA> <NA> 29 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 227.205   6.000 <NA> <NA> 28 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 233.205   0.525 <NA> <NA> 29 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 233.730  13.875 <NA> <NA> 28 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 247.605   0.675 <NA> <NA> 29 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 248.280  26.505 <NA> <NA> 28 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 274.785   0.750 <NA> <NA> 29 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 275.535  26.220 <NA> <NA> 28 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 301.755   0.750 <NA> <NA> 29 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 302.505   1.500 <NA> <NA> 28 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 304.005   0.645 <NA> <NA> 23 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 304.650  22.995 <NA> <NA> 28 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 327.645   1.500 <NA> <NA> 29 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 329.145   1.500 <NA> <NA> 28 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 330.645   0.525 <NA> <NA> 29 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 331.170   7.125 <NA> <NA> 28 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 338.295   3.750 <NA> <NA> 29 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 342.045  14.880 <NA> <NA> 28 <NA> <NA>
SPEAKER ENG_2018_09_24_AM_800kbps_NCH_v2 0 356.925   0.750 <NA> <NA> 29 <NA> <NA>

david-ryan-snyder commented 6 years ago

You'll need to post the actual error/warning for me to look into it more. From your description, it sounds like it's only a warning, not an error.

mmaciej2 commented 6 years ago

@akshatdewan,

It looks like perhaps your VAD is probably not giving good output. You should try to check its output before passing it to segmentation.pl to make sure it is reasonable in the first place. It's unlikely the very long speech segments with few silence regions is a bug in the segmentation.pl script. It probably is an issue with the VAD (e.g. bad parameters or just a data mismatch of some kind).

As for the bad diarization output, it's hard to say what is going wrong there given the problem with the initial segmentation. It also might be a data mismatch that is degrading performance. It also could be that since the segmentation is bad, there's a lot of "bad" (nonspeech?) regions that are getting put into the clustering algorithm and overwhelming the speaker clusters.

akshatdewan commented 6 years ago

Hi, many thanks for your inputs! I had a very high silence-proportion of 0.3 which resulted in no inter-segmental gap and lot of non-speech in the segments. I have tried this again with a lower value silence-proportion 0.05 and I see improvements.

pbirsinger commented 6 years ago

Hi, super useful thread - thanks all! I've actually gotten all of @david-ryan-snyder's steps to technically run, but my diarization output is very wrong, and I suspect it has to do with my initial segments file.

My questions are:

1) why in v2/run.sh do you re-compute the segments with compute_vad_decision then vad_to_segments? Aren't the segments already inputted (computing features with make_mfcc seems to require them), or are they mock inputted as @iacoshoria describes with just a simple mapping from recording_id to recording_id?

2) I try @iacoshoria 's method of using a simple mapping of recording_id to recording_id for utt2spk and spk2utt, then running make_mfcc, compute_vad, and then vad_to_segments to get this for segments:

marks1-000000-001486 marks1 0.00 14.86
marks1-001486-002774 marks1 14.86 27.74
marks1-002774-004060 marks1 27.74 40.6
marks1-004060-005493 marks1 40.60 54.93
marks1-005493-006968 marks1 54.93 69.68
marks1-006968-008457 marks1 69.68 84.57
marks1-008457-009779 marks1 84.57 97.79
marks1-009779-011134 marks1 97.79 111.34
marks1-011134-012464 marks1 111.34 124.64
marks1-012464-013824 marks1 124.64 138.24
marks1-013824-015316 marks1 138.24 153.16
marks1-015316-016727 marks1 153.16 167.27
marks1-016727-018178 marks1 167.27 181.78
marks1-018178-019678 marks1 181.78 196.78
marks1-019678-020881 marks1 196.78 208.81
marks1-020881-021237 marks1 208.81 212.37

These segments are definitely wrong though as they clearly miss some speaker changes. Is this process right though to get segments? Where could it be going wrong?

3) Alternatively, I tried to use http://kaldi-asr.org/models/m4 as @david-ryan-snyder suggested to create the segments file with the following command:

(venv3) ~/podible/kaldi/egs/callhome_diarization/v2 (diarization|✚2…)$ steps/segmentation/detect_speech_activity.sh --nj 1 --cmd "$train_cmd" \
        --extra-left-context 79 --extra-right-context 21 \
        --extra-left-context-initial 0 --extra-right-context-final 0 \
        --frames-per-chunk 150 --mfcc-config "${callhome_dir}"/0004/conf/mfcc_hires.conf \
          "${callhome_dir}"/v2/data/"${name}" \
          "${callhome_dir}"/0004/exp/segmentation_1a/tdnn_stats_asr_sad_1a \
          "${callhome_dir}"/v2/data/mfcc \
          "${callhome_dir}"/0004/exp/segmentation_1a/tdnn_stats_asr_sad_1a \
          "${callhome_dir}"/v2/data/"${name}"
--nj 1 --cmd run.pl --extra-left-context 79 --extra-right-context 21 --extra-left-context-initial 0 --extra-right-context-final 0 --frames-per-chunk 150 --mfcc-config /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/0004/conf/mfcc_hires.conf /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/v2/data/marks_ex4 /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/0004/exp/segmentation_1a/tdnn_stats_asr_sad_1a /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/v2/data/mfcc /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/0004/exp/segmentation_1a/tdnn_stats_asr_sad_1a /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/v2/data/marks_ex4
rm: data/marks_ex4_whole_hires: No such file or directory
utils/data/convert_data_dir_to_whole.sh: Data directory already does not contain segments. So just copying it.
utils/copy_data_dir.sh: copied data from /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/v2/data/marks_ex4 to data/marks_ex4_whole_hires
utils/validate_data_dir.sh: Successfully validated data-directory data/marks_ex4_whole_hires
fix_data_dir.sh: kept all        1 utterances.
fix_data_dir.sh: old files are kept in data/marks_ex4_whole_hires/.backup
steps/make_mfcc.sh --mfcc-config /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/0004/conf/mfcc_hires.conf --nj 1 --cmd run.pl --write-utt2num-frames true data/marks_ex4_whole_hires exp/make_hires/marks_ex4 /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/v2/data/mfcc
steps/make_mfcc.sh: moving data/marks_ex4_whole_hires/feats.scp to data/marks_ex4_whole_hires/.backup
utils/validate_data_dir.sh: Successfully validated data-directory data/marks_ex4_whole_hires
steps/make_mfcc.sh: [info]: no segments file exists: assuming wav.scp indexed by utterance.
Succeeded creating MFCC features for marks_ex4_whole_hires
steps/compute_cmvn_stats.sh data/marks_ex4_whole_hires exp/make_hires/marks_ex4 /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/v2/data/mfcc
Succeeded creating CMVN stats for marks_ex4_whole_hires
fix_data_dir.sh: kept all        1 utterances.
fix_data_dir.sh: old files are kept in data/marks_ex4_whole_hires/.backup
readlink: illegal option -- f
usage: readlink [-n] [file ...]
readlink: illegal option -- f
usage: readlink [-n] [file ...]
steps/nnet3/compute_output.sh --nj 1 --cmd run.pl --iter final --extra-left-context 79 --extra-right-context 21 --extra-left-context-initial 0 --extra-right-context-final 0 --frames-per-chunk 150 --apply-exp true --frame-subsampling-factor 1 data/marks_ex4_whole_hires /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/0004/exp/segmentation_1a/tdnn_stats_asr_sad_1a /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/0004/exp/segmentation_1a/tdnn_stats_asr_sad_1a/sad_marks_ex4_whole
utils/data/get_utt2dur.sh: segments file does not exist so getting durations from wave files
utils/data/get_utt2dur.sh: could not get utterance lengths from sphere-file headers, using wav-to-duration
in get utt2dur nj========== 1
utils/data/get_utt2dur.sh: computed data/marks_ex4_whole_hires/utt2dur
feat-to-len 'scp:head -n 10 data/marks_ex4_whole_hires/feats.scp|' ark,t:-
utils/data/get_utt2dur.sh: data/marks_ex4_whole_hires/utt2dur already exists with the expected length.  We won't recompute it.
utils/data/subsegment_data_dir.sh: note: frame shift is 0.01 [affects feats.scp]
utils/data/get_utt2num_frames.sh: data/marks_ex4_whole_hires/utt2num_frames already present!
Fixed row_end for marks1-0020865-0021236 from 21236 to 21235-1
Fixed row_end for marks1-0020865-0021236 from 21236 to 21235-1
utils/data/subsegment_data_dir.sh: subsegmented data from data/marks_ex4_whole_hires to /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/v2/data/marks_ex4_seg
cp: /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/v2/data/marks_ex4/stm: No such file or directory
cp: /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/v2/data/marks_ex4/reco2file_and_channel: No such file or directory
cp: /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/v2/data/marks_ex4/glm: No such file or directory
fix_data_dir.sh: kept all       88 utterances.
fix_data_dir.sh: old files are kept in /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/v2/data/marks_ex4_seg/.backup
steps/segmentation/detect_speech_activity.sh: Created output segmented kaldi data directory in /Users/pbirsinger/podible/kaldi/egs/callhome_diarization/v2/data/marks_ex4_seg

This gives me an entirely different segments file, equally or more wrong:

marks1-0000000-0000084 marks1 0.00 0.84
marks1-0000102-0000185 marks1 1.02 1.85
marks1-0000185-0000273 marks1 1.85 2.73
marks1-0000273-0000367 marks1 2.73 3.67
marks1-0000371-0000499 marks1 3.71 4.99
marks1-0000613-0000691 marks1 6.13 6.91
marks1-0000726-0000877 marks1 7.26 8.78
marks1-0000877-0001126 marks1 8.78 11.26
marks1-0001126-0001280 marks1 11.26 12.8
marks1-0001280-0001487 marks1 12.80 14.87
marks1-0001487-0001653 marks1 14.87 16.53
marks1-0001653-0001925 marks1 16.53 19.25
marks1-0001937-0002265 marks1 19.37 22.65
marks1-0002265-0002502 marks1 22.65 25.02
marks1-0002502-0002769 marks1 25.02 27.69
marks1-0002769-0002990 marks1 27.69 29.9
marks1-0003049-0003327 marks1 30.49 33.27
marks1-0003327-0003483 marks1 33.27 34.83
marks1-0003496-0003576 marks1 34.96 35.76
marks1-0003622-0003754 marks1 36.23 37.54
marks1-0003754-0003851 marks1 37.54 38.51
marks1-0003851-0003961 marks1 38.51 39.61
marks1-0003961-0004069 marks1 39.61 40.69
marks1-0004069-0004414 marks1 40.69 44.14
marks1-0004466-0004596 marks1 44.66 45.96
marks1-0004627-0004775 marks1 46.27 47.75
marks1-0004863-0005073 marks1 48.63 50.73
marks1-0005172-0005299 marks1 51.72 52.99
marks1-0005304-0005385 marks1 53.04 53.85
marks1-0005397-0005497 marks1 53.97 54.97
marks1-0005497-0005565 marks1 54.97 55.65
marks1-0005569-0005790 marks1 55.69 57.9
marks1-0005839-0006043 marks1 58.39 60.43
marks1-0006111-0006213 marks1 61.11 62.13
marks1-0006213-0006725 marks1 62.13 67.25
marks1-0006725-0006881 marks1 67.25 68.81
marks1-0006881-0007112 marks1 68.81 71.12
marks1-0007112-0007170 marks1 71.12 71.7
marks1-0007170-0007280 marks1 71.70 72.8
marks1-0007280-0007344 marks1 72.80 73.44
marks1-0007414-0007775 marks1 74.14 77.75
marks1-0007817-0007917 marks1 78.18 79.17
marks1-0007917-0008399 marks1 79.17 83.99
marks1-0008399-0008465 marks1 83.99 84.65
marks1-0008465-0008675 marks1 84.65 86.75
marks1-0008675-0008815 marks1 86.75 88.15
marks1-0008815-0008871 marks1 88.15 88.71
marks1-0008871-0008925 marks1 88.71 89.25
marks1-0008925-0009188 marks1 89.25 91.88
marks1-0009188-0009263 marks1 91.88 92.63
marks1-0009263-0009336 marks1 92.63 93.36
marks1-0009354-0009638 marks1 93.54 96.38
marks1-0009638-0009769 marks1 96.38 97.69
marks1-0009769-0010017 marks1 97.69 100.17
marks1-0010017-0010180 marks1 100.17 101.8
marks1-0010180-0010684 marks1 101.80 106.84
marks1-0010725-0011009 marks1 107.25 110.09
marks1-0011009-0011080 marks1 110.09 110.8
marks1-0011126-0011709 marks1 111.26 117.09
marks1-0011718-0012352 marks1 117.18 123.52
marks1-0012486-0012941 marks1 124.86 129.41
marks1-0012941-0013219 marks1 129.41 132.2
marks1-0013219-0013335 marks1 132.20 133.35
marks1-0013339-0013828 marks1 133.39 138.28
marks1-0013828-0014100 marks1 138.28 141
marks1-0014163-0014472 marks1 141.63 144.72
marks1-0014610-0014769 marks1 146.10 147.7
marks1-0014769-0014917 marks1 147.70 149.18
marks1-0014917-0015241 marks1 149.18 152.41
marks1-0015258-0015328 marks1 152.58 153.28
marks1-0015328-0015797 marks1 153.28 157.98
marks1-0015859-0016207 marks1 158.59 162.07
marks1-0016207-0016508 marks1 162.07 165.08
marks1-0016508-0016733 marks1 165.08 167.33
marks1-0016733-0016832 marks1 167.33 168.32
marks1-0016943-0017740 marks1 169.43 177.4
marks1-0017746-0017834 marks1 177.46 178.34
marks1-0017834-0018170 marks1 178.34 181.71
marks1-0018170-0018467 marks1 181.71 184.67
marks1-0018467-0018559 marks1 184.67 185.59
marks1-0018559-0019141 marks1 185.59 191.41
marks1-0019141-0019276 marks1 191.41 192.76
marks1-0019276-0019603 marks1 192.76 196.03
marks1-0019603-0020112 marks1 196.03 201.12
marks1-0020124-0020205 marks1 201.24 202.05
marks1-0020260-0020666 marks1 202.60 206.66
marks1-0020666-0020865 marks1 206.66 208.65
marks1-0020865-0021236 marks1 208.65 212.37

Where is this going wrong?

Thank you very much in advance!

mmaciej2 commented 6 years ago

Hi @pbirsinger,

For question 1, there is no initial segmentation. The make_mfcc script does not require initial segments. If there is no segments file available, it treats each recording in the wav.scp as one giant segment and computes features for the full recording.

We do compute_vad_decision followed by vad_to_segments to create segmentation using a very naive speech activity detection method, i.e. just looking at the energy in each frame. It will not produce particularly good results unless the recording is very high-quality, i.e. strong signal and no noise, and even then the parameters might need to be tuned to get proper output.

It's also worth noting that this initial segmentation is supposed to be speech-activity-detection–style segmentation, not speech-recognition–style segmentation, so it should not be detecting speaker changes if there is no appreciable silence between the speakers' utterances.

As for question 3, it's not immediately clear what is going wrong, especially without taking a closer look at everything. Perhaps there is some kind of mismatch? Have you tried loading the SAD labels into something like Audacity to view while listening to the file? That might help you narrow it down to whether or not it's doing what it is supposed to, just extremely poorly, or if it's not even doing something appropriate at all (e.g. does it ever label silence where there is speech, are there ever silence regions that are properly labeled as silence, etc.).

pbirsinger commented 6 years ago

Hi @mmaciej2 - really appreciate the fast response!

After some further experimentation, the latter segments file built from the m4 model seems plausible. Silences do seem to match up, although there sure are a lot of smaller segments (is this a problem?).

However, when proceeding with the commands with default values that @david-ryan-snyder posted, the rttm output file is definitely off:

SPEAKER marks1 0   0.000   0.840 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0   1.020   2.650 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0   3.710   1.280 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0   6.130   0.780 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0   7.260  11.990 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0  19.370  10.530 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0  30.490   4.340 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0  34.960   0.800 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0  36.230   7.910 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0  44.660   1.300 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0  46.270   1.480 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0  48.630   2.100 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0  51.720   1.270 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0  53.040   0.810 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0  53.970   1.680 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0  55.690   2.210 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0  58.390   2.040 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0  61.110  11.690 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0  72.800   0.640 <NA> <NA> 1 <NA> <NA>
SPEAKER marks1 0  74.140   3.610 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0  78.180  15.180 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0  93.540  13.300 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0 107.250   3.550 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0 111.260   5.830 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0 117.180   6.340 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0 124.860   8.490 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0 133.390   7.610 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0 141.630   3.090 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0 146.100   6.310 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0 152.580   5.400 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0 158.590   9.730 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0 169.430   7.970 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0 177.460  23.660 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0 201.240   0.810 <NA> <NA> 2 <NA> <NA>
SPEAKER marks1 0 202.600   9.770 <NA> <NA> 2 <NA> <NA>

when the actual result should be:


audio=marks-16.wav lna=a_1 start-time=0.152 end-time=30.352 speaker=speaker_1
audio=marks-16.wav lna=a_2 start-time=30.352 end-time=61.112 speaker=speaker_2
audio=marks-16.wav lna=a_3 start-time=61.112 end-time=108.132 speaker=speaker_1
audio=marks-16.wav lna=a_4 start-time=108.132 end-time=157.904 speaker=speaker_2
audio=marks-16.wav lna=a_5 start-time=157.904 end-time=212.36 speaker=speaker_1

I also set number of speakers to 2 in reco2num_spk.

I'm a bit at a loss to figure out where the process is going wrong now - any ideas?

Thanks!

mmaciej2 commented 6 years ago

@pbirsinger,

One thing you should do is combine segments that are connected, if the speech activity detection system produces it. I believe that m4 ASpIRE model will split up long speech segments, which is undesirable behavior for diarization, though I'm not 100% sure it will do that.

Short segments are not inherently undesirable, but they can be problematic. If a segment is shorter than the sliding window used in xvector extraction, it will result in a less-reliable embedding, in addition to having a slight mismatch on how many frames went into the embedding. At the same time, if there is a short turn, the sliding window extraction might miss the speaker change, while if the speech activity detection segments it out, there's a chance you'll catch it. There's a bit of a trade-off going on, but in general I'd lean toward suggesting having longer segments. You can either tune the speech activity detection system, or you can even do something more simple like merging segments if the silence between them is shorter than some threshold.

As for the diarization output being off, again, it's hard to say what's going on. I'd recommend again looking at the labels along with the audio and see if you can figure out why almost everything is being attributed to the same speaker—i.e. is there anything special about what didn't. It's very possible that there is something like a laugh, which gets marked as speech by the speech activity detection system, but ends up being very dissimilar from the rest of the recording.

This can illustrate one of the downsides of using a cluster stop criterion of the "correct" number of speakers as opposed to using a tuned threshold. Since the system is based on a similarity metric, it's possible that the difference between the two speakers ends up being smaller than the difference between a speaker and something like a laugh by that speaker. As a result, if you cluster to 2 speakers specifically, you're asking it to segment into the two most dissimilar categories, which is not the two speakers. In contrast, if you are clustering according to a threshold tuned to approximate the boundary between same- and different-speaker speech, it is more likely to find 3 clusters, which despite being incorrect, results in more accurate output. Now, I'm not saying that that is what is happening with your setup, but something like that can definitely happen and would manifest in recordings that are almost entirely attributed to a single speaker.

itaipee commented 5 years ago

Thanks @david-ryan-snyder this was very helpful I have question about Scoring PLDA : where does "$nnet_dir/xvectors_plda/" came from ?
It is not in the original script , run.sh , nor exists in the model?

Perform PLDA scoring Now that we have many embeddings extracted from short overlapping windows of the recordings, we use PLDA to compute their pair-wise similarity. If you use a pretrained model, look in the README to see where a PLDA model lives. diarization/nnet3/xvector/score_plda.sh --cmd "$train_cmd_intel --mem 4G" \ --target-energy 0.9 --nj 20 $nnet_dir/xvectors_plda/ \ $nnet_dir/xvectors_$name \ $nnet_dir/xvectors_$name/plda_scores

david-ryan-snyder commented 5 years ago

The usage message for this script says that the arguments are .

So the first argument is the directory containing the PLDA model. The second argument is the name of the directory containing the x-vectors we're going to compare, and the last argument is the output directory, where the score matrixes are written.

itaipee commented 5 years ago

The usage message for this script says that the arguments are .

So the first argument is the directory containing the PLDA model. The second argument is the name of the directory containing the x-vectors we're going to compare, and the last argument is the output directory, where the score matrixes are written.

So if I use the model from http://kaldi-asr.org/models/m6 , I should use $nnet_dir/xvectors_callhome1 or $nnet_dir/xvectors_callhome2 ?

mmaciej2 commented 5 years ago

@itaipee, you can use either the $nnet_dir/xvectors_callhome1 or $nnet_dir/xvectors_callhome2 directory for the PLDA. In the callhome recipe, since we do not have a held-out set to whiten with, we divide the evaluation set in two. We whiten on callhome1 to score callhome2 and vice-versa in order to score the full set fairly. In theory the two PLDA models should be comparable.

SNSOHANI commented 5 years ago

Hello There! I am new to Speech Technology and Speech Processing. I have to make a Speaker Diarization System. Can someone kindly guide me what to begin with?

mmaciej2 commented 5 years ago

@SNSOHANI, our basic diarization systems are in egs/callhome_diarization. The v1 directory uses ivectors, the v2 directory uses xvectors (two methods of extracting speaker-identifiation vectors, the latter using a neural network). Both directories refer to a paper at the top of the run.sh file. I would read those papers and take a look at the run.sh scripts to try to get an understanding of how these things work.

SNSOHANI commented 5 years ago

@mmaciej2 Thank You very much for the avid response!

liuyue94 commented 5 years ago

Hello, I know there is a Callhome Diarization Xvector Model ( http://kaldi-asr.org/models/m6),I download this model,but i don‘t know how to use it? I want to know how to use the model directly to test some data so as to check it's effect？ Thanks very much!

jardnzm commented 5 years ago

Hi, I run with the pretrained model and it works well when I did not use the number of speakers. However, when I use reco2num_spk to set the speakers number to 3, it has an error

diarization/cluster.sh --cmd run.pl --mem 4G --nj 1 --reco2num-spk data/reco2num_spk 0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores 0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers diarization/cluster.sh: clustering scores bash: line 1: 36018 Abort trap: 6 ( agglomerative-cluster --threshold=0.5 --read-costs=false --reco2num-spk-rspecifier=ark,t:data/reco2num_spk --max-spk-fraction=1.0 --first-pass-max-utterances=32767 "scp:utils/filter_scp.pl 0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/tmp/split1/1/spk2utt 0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores/scores.scp |" ark,t:0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/tmp/split1/1/spk2utt ark,t:0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/labels.1 ) 2>> 0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/log/agglomerative_cluster.1.log >> 0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/log/agglomerative_cluster.1.log run.pl: job failed, log is in 0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/log/agglomerative_cluster.1.log

my wav.scp is data/test.wav data/test.wav my reco2num_spk is data/test.wav 3 the log error is ERROR (agglomerative-cluster[5.5.337~1-35f96]:Value():util/kaldi-table-inl.h:2402) Value() called but no such key data/test.wav in archive data/reco2num_spk

could you please help me with that?

danpovey commented 5 years ago

Probably there is a mismatch in the recording-ids between two different data sources.

On Mon, Jun 10, 2019 at 4:44 PM Zimmath notifications@github.com wrote:

Hi, I run with the pretrained model and it works totally fun when I did not use the number of speakers. However, when I use reco2num_spk to set the speakers number to 3, it has an error

diarization/cluster.sh --cmd run.pl --mem 4G --nj 1 --reco2num-spk data/reco2num_spk 0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores 0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers diarization/cluster.sh: clustering scores bash: line 1: 36018 Abort trap: 6 ( agglomerative-cluster --threshold=0.5 --read-costs=false --reco2num-spk-rspecifier=ark,t:data/reco2num_spk --max-spk-fraction=1.0 --first-pass-max-utterances=32767 "scp:utils/ filter_scp.pl 0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/tmp/split1/1/spk2utt 0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores/scores.scp |" ark,t:0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/tmp/split1/1/spk2utt ark,t:0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/labels.1 ) 2>> 0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/log/agglomerative_cluster.1.log

0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/log/agglomerative_cluster.1.log run.pl: job failed, log is in 0006_callhome_diarization_v2_1a/exp/xvector_nnet_1a//xvectors_name/plda_scores_num_speakers/log/agglomerative_cluster.1.log

my wav.scp is data/test.wav data/test.wav my reco2num_spk is data/test.wav 3 the log error is ERROR (agglomerative-cluster[5.5.337~1-35f96]:Value():util/kaldi-table-inl.h:2402) Value() called but no such key data/test.wav in archive data/reco2num_spk

could you please help me with that?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2523?email_source=notifications&email_token=AAZFLO6UMK3ERXMXMQAQ7QLPZ24L5A5CNFSM4FHFW2B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXLFWFY#issuecomment-500587287, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZFLO7G6SVQA7Y23FNJXCLPZ24L5ANCNFSM4FHFW2BQ .

uygnef commented 5 years ago

Hi, there I get plda_scores/scores.1.ark file and it seems like a binary file. how can I convert it to readable format? Is it a matrix show one-one pairs similarity of all segments? Here is a part of plda_scores/scores.1.ark file f96b18a5441007a ^@BFM ^D¼^@^@^@^D¼^@^@^@^Y¶^EBÄªzAÖ?üA¶^F<8f>AIìîAý1<85>A:¿âÀ^_äKÁf³{Á<96>z÷Aî7FAªP¿Áêì¤ÁÅögÁw<99>ÍÁ<9e>£«ÁâY<95>Áw^X<98>Á³B¡ÁÏÿ{Áñå<9e>A^T?é@æyûÀÓd!Á Ìw@99Æ@Ä,<86>À#^YTÁ^R#áA^EªEAgÃôÀ©z%Áº¾ºA·/<82>A:<93>µA<84>^^QAÞÙLÁ¯A<91>Á<8e><9e>

mmaciej2 commented 5 years ago

You should read this page on Kaldi I/O mechanisms: http://kaldi-asr.org/doc/io.html I believe the copy-matrix binary should be usable for converting it to a readable format, as unless I'm mistaken the plda scores are in fact similarity matrices as you suspect.

danpovey commented 5 years ago

Regarding $nnet_dir... if you are not sure for any script, where something is supposed to come from, do git grep that-script-name from the top level of kaldi and look for example scripts that use it.

On Mon, Jul 29, 2019 at 2:01 AM Jee Wen Jie notifications@github.com wrote:

Thanks for the guide Daniel. I could follow it up till the Extract Embeddings. How should I prepare $nnet_dir? extract_xvectors.sh checks for final.raw, min_chunk_size and max_chunk_size and I'm not sure how to get them.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2523?email_source=notifications&email_token=AAZFLO7YSBFHVK6YRJVDYEDQB2WXLA5CNFSM4FHFW2B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3ACH5Y#issuecomment-515908599, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZFLOYN7IIUJIL67CK4SWDQB2WXLANCNFSM4FHFW2BQ .

anderleich commented 5 years ago

Is there a problem if recordings (actual wav files) have been segmented by speaker, so that in each wav file there is just one speaker?

danpovey commented 5 years ago

If you have that then I'd say you don't need to do diarization because the work has already been done.

Perhaps what you need is just voice activity detection.

Is there a problem if recordings (actual wav files) have been segmented by

speaker, so that in each wav file there is just one speaker?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2523?email_source=notifications&email_token=AAZFLOZ7T4YZIMBEI2RGRZ3QGU63JA5CNFSM4FHFW2B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5IBRVY#issuecomment-525342935, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZFLO5WKUVTFIK2VFPPVT3QGU63JANCNFSM4FHFW2BQ .

anderleich commented 5 years ago

@danpovey even for training? They are part of the same audio which I've segmented by speaker to have shorter files. I want to use them for speaker diarization training. Should I merge them back?

david-ryan-snyder commented 5 years ago

This should be fine for training the x-vector DNN, unless the segments are really short. If they're less than 3 seconds, you may need to merge them together in some way, since we usually train the x-vector DNN on segments of that length. You also need to ensure that all segments from the same speaker share the same class label. Whatever works for training the x-vector DNN should also be fine for PLDA training.

Our diarization system doesn't require multi-speaker audio to train on. It's trained as if it will be used for single-speaker speaker recognition. We utilize it for diarizing multi-speaker recordings by extracting embeddings from short segments (and we generally assume one speaker per segment) and then cluster them to identify where the different speakers appear.

roshansh-cmu commented 5 years ago

# Started at Tue Sep 10 22:57:48 EDT 2019                                                                                                                                                                                                                                                                                                                                                                                  
agglomerative-cluster --threshold=0.5 --read-costs=false --reco2num-spk-rspecifier=ark,t:data/iphone_data/reco2num_spk --max-spk-fraction=1.0 --first-pass-max-utterances=32767 'scp:utils/filter_scp.pl exp/xvector_nnet
_1a/xvectors_iphone_data/plda_scores_num_speakers/tmp/split20/10/spk2utt exp/xvector_nnet_1a/xvectors_iphone_data/plda_scores/scores.scp |' ark,t:exp/xvector_nnet_1a/xvectors_iphone_data/plda_scores_num_speakers/tmp/s
plit20/10/spk2utt ark,t:exp/xvector_nnet_1a/xvectors_iphone_data/plda_scores_num_speakers/labels.10                                                                                                                      
ERROR (agglomerative-cluster[5.5.347~1-8b54]:Value():util/kaldi-table-inl.h:2402) Value() called but no such key 2_d_S1 in archive data/iphone_data/reco2num_spk                                                         

[ Stack-Trace: ]                                                                                                                                                                                                         
kaldi::MessageLogger::LogMessage() const                                                                                                                                                                                 
kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)                                                                                                                                                
kaldi::RandomAccessTableReaderUnsortedArchiveImpl >::Value(std::__cxx11::basic_string, std::allocator > const&)                                               
main                                                                                                                                                                                                                     
__libc_start_main                                                                                                                                                                                                        
_start                                                                                                                                                                                                                   

kaldi::KaldiFatalError# Accounting: time=53 threads=1                                                                                                                                                                    
# Ended (code 255) at Tue Sep 10 22:58:41 EDT 2019, elapsed time 53 seconds

My reco2num_spk file has entries in the form . I was wondering why spk2utt is used for any comparison at all in the clustering step. The agglomerative clustering bin file reads that reco2utt is required not spk2utt. Please clarify what is to be used?

My speaker ID is _ which I believe is acceptable. S1 is a speaker tag in my data I would appreciate help in trying to debug the issue. All the files are in the correct Kaldi formats.

EDIT: I got it to work by creating a reco2num_spk file that has the format 1 for all speakers. But I don't think this is the right way to do it as all segments are labelled 1.

danpovey commented 5 years ago

In general, Kaldi programs are not supposed to call Value() without checking that the key exists in the table. It should be rewritten to check that, and probably print a warning if it's not present; and probably accumulate some kind of count of how many were not present and if it was more than half, exit with error status at the end. If you have time to make a PR it would be good.

On Thu, Sep 12, 2019 at 4:21 AM Roshan S Sharma notifications@github.com wrote:

agglomerative-cluster --threshold=0.5 --read-costs=false --reco2num-spk-rspecifier=ark,t:data/iphone_data/reco2num_spk --max-spk-fraction=1.0 --first-pass-max-utterances=32767 "scp:utils/filter_scp.pl exp/xvector_nn

et_1a/xvectors_iphone_data/plda_scores_num_speakers/tmp/split20/10/spk2utt exp/xvector_nnet_1a/xvectors_iphone_data/plda_scores/scores.scp |" ark,t:exp/xvector_nnet_1a/xvectors_iphone_data/plda_scores_num_speakers/tmp /split20/10/spk2utt ark,t:exp/xvector_nnet_1a/xvectors_iphone_data/plda_scores_num_speakers/labels.10

Started at Tue Sep 10 22:57:48 EDT 2019

agglomerative-cluster --threshold=0.5 --read-costs=false --reco2num-spk-rspecifier=ark,t:data/iphone_data/reco2num_spk --max-spk-fraction=1.0 --first-pass-max-utterances=32767 'scp:utils/filter_scp.pl exp/xvector_nnet _1a/xvectors_iphone_data/plda_scores_num_speakers/tmp/split20/10/spk2utt exp/xvector_nnet_1a/xvectors_iphone_data/plda_scores/scores.scp |' ark,t:exp/xvector_nnet_1a/xvectors_iphone_data/plda_scores_num_speakers/tmp/s plit20/10/spk2utt ark,t:exp/xvector_nnet_1a/xvectors_iphone_data/plda_scores_num_speakers/labels.10 ERROR (agglomerative-cluster[5.5.347~1-8b54]:Value():util/kaldi-table-inl.h:2402) Value() called but no such key 2_d_S1 in archive data/iphone_data/reco2num_spk

[ Stack-Trace: ] kaldi::MessageLogger::LogMessage() const kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&) kaldi::RandomAccessTableReaderUnsortedArchiveImpl >::Value(std::__cxx11::basic_string, std::allocator > const&) main __libc_start_main _start

kaldi::KaldiFatalError# Accounting: time=53 threads=1

Ended (code 255) at Tue Sep 10 22:58:41 EDT 2019, elapsed time 53 seconds

My reco2num_spk file has entries in the form . I was wondering why spk2utt is used for any comparison at all in the clustering step. The agglomerative clustering bin file reads that reco2utt is required not spk2utt. Please clarify what is to be used?

My speaker ID is _ which I believe is acceptable. S1 is a speaker tag in my data I would appreciate help in trying to debug the issue. All the files are in the correct Kaldi formats.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2523?email_source=notifications&email_token=AAZFLOZ6MFQEF47GPJAPS7LQJFHLBA5CNFSM4FHFW2B2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6PYT2I#issuecomment-530549225, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZFLOZTZBLTRGMAWSOFXMDQJFHLBANCNFSM4FHFW2BQ .

roshansh-cmu commented 5 years ago

@danpovey , Thanks for the feedback.

As far as I can see, the plda scores and spk2utt files use "speaker_ID" keys , whereas the reco2num_spk file has "Recording_ID" keys, which don't match what is expected, and that causes my error. It would be good to understand if the recording IDs are what is truly expected.
I tried replacing the reco2num_spk file with a spk2num_spk file, which has the format 1, and this produced diarization outputs all mapped to the label 1, but the Key error was resolved. That means that agglomerative-cluster.cc expects speaker ID keys, which I am not sure how to include.
Another thing to note is that I was able to get it to work with arbitrary unknown speakers, but am having issues with the reco2_numspk version, where I can deterministically specify how many speakers are in the conversation.
Unfortunately, I don't have experience working with C++ to make a PR, and furthermore, I am not sure I understand the code fully. If you believe that the code is structured right, and there is an explanation for the apparent key discrepancy, I would be happy to take a look again, and at the very least find the issue.

@david-ryan-snyder , would appreciate your feedback too, if any.

mmaciej2 commented 5 years ago

@RSSharma246,

There seem to be some basic misunderstandings here, and I want to try to clear them up before you lead yourself further down an incorrect path. It's hard to figure out what is going wrong with your setup, but hopefully I can give you the tools to figure it out yourself.

First of all, I'd recommend looking at the usage documentation for agglomerative-cluster and the diarization/cluster.sh script that is running it. It explains how these things work and what format the inputs should be in.

I think the misunderstanding is that there is nothing inherently special about reco2num_spk, spk2utt, etc. They are essentially just tables in text format. They are named certain ways to make things more easily understandable, but, for example, there is no difference between reco2num_spk and utt2num_spk besides that we choose to call the keys in reco2num_spk recordings and the keys in utt2num_spk utterances. So, don't get too caught up in what the names of the files are while debugging. The names can be useful, but it's what the files contain that's important. Read the documentation of the higher level scripts (for example diarization/cluster.sh) and the C++ scripts they call (for example aggolmerative-cluster.cc) and make sure you understand what the tables are being used for, so you can verify that their contents are consistent with what they should be.

But, to more specifically address your questions/comments: From the usage message of agglomerative-cluster, the "scores" archive has recording IDs as keys and square score matrices as values (which are the plda scores). The second argument (which happens to be called named spk2utt in the recipe for a reason I cannot recall) is the mapping from recordings to utterances, so the keys are recording IDs and the values are lists of utterances in that recording, which are actually the labels of the rows/columns of the score matrices. And reco2num_spk contains the a priori information of how many speakers are present within a recording, so the keys are just recording IDs and the values are the number of speakers that were present within that recording.

Regardless of what the names of the files are (for example that the reco2utt filename is called spk2utt), they should contain the above information. You should check to make sure that the files contain the right things. If they don't, then you need to figure out what created them, because that is probably where the bug came from, not agglomerative-cluster.

mmaciej2 commented 5 years ago

@RSSharma246, That is good to hear. I can see how the weird naming conventions can be problematic for someone trying to understand the code. I can't remember the exact reasons why it is that way, but the diarization/extract_ivectors.sh script does have some explanation for the weird naming conventions. Perhaps this is something that someone should look into changing. I will not have the time anytime soon, but maybe @HuangZiliAndy or perhaps @david-ryan-snyder can take a look.

roshansh-cmu commented 5 years ago

@mmaciej2 @danpovey - I did look at all the files again, and I think there is a gap somewhere either in my understanding or the current setup. In this post, I am using "standard" kaldi file names to not confuse anyone further. This is what is happening:

The X-Vectors are computed per utterance- which is right
The PLDA Scores are computed per speaker- so speaker ID is the key for PLDA Scores
Then in diarization/cluster.sh, the first step calls agglomerative-cluster.cc with parameters as spk2utt, reco2num_spk, plda scores scp, and output as label file. According to the documentation for agglomerative-cluster.cc, it expects not spk2utt, but a mapping from recording_ids to utterance_ids. But the plda_score keys are speaker_ids, not recording_ids. Therefore, I see this key mismatch error- the reco2num_spk file.

Two possibilities:

The PLDA Scores should be computed per recording rather than per speaker- which means there is a bug in the PLDA Scoring file diarization/nnet3/xvector/score_plda.sh
The PLDA Scores should be computed per speaker(which makes more sense I think) , which means there is a natural key mismatch in the agglomerative-cluster.cc step in diarization/cluster.sh since we provide record_id to number of speakers as the reco2num_spk map, and spk2utt which maps speaker_id to utt_ids

Which of the two cases is true? Understanding that should help fix my problems

mmaciej2 commented 5 years ago

@RSSharma246, In a diarization system, nothing is done on a per-speaker basis. If you knew the speaker labels, you wouldn't need to perform diarization. The PLDA scores archive contains all pairwise scores between subsegments within a recording, so they can then be partitioned according to their similarity score to identify the speaker labels.

roshansh-cmu commented 5 years ago

Issue resolved!

dev-sajal commented 5 years ago

@iacoshoria the recipe is not bound to this dataset. We are talking about making a diarization recipe based on some freely available dataset (such as AMI), but that will probably not be in the main branch of Kaldi any time soon (@mmaciej2, do you have an easy to follow AMI recipe somewhere you can point to?).

I recently ran diarization on the "Speakers in the Wild" dataset. I'll show you a few lines of each of the files I created in order to diarize it. I will then go over the steps needed to diarize a generic dataset. Hopefully you can generalize this to your own.

wav.scp This file is easy to create if you are familiar with Kaldi data preparation. It's in the form of <recording-id> <wav-file>. In my directory, it happens to look like this: aahtm sox -t flac /export/common/data/corpora/NIST/SRE/sitw_database.v4/dev/audio/aahtm.flac -t wav -r 16k -b 16 - channels 1 | aaoao sox -t flac /export/common/data/corpora/NIST/SRE/sitw_database.v4/dev/audio/aaoao.flac -t wav -r 16k -b 16 - channels 1 | abvwc sox -t flac /export/common/data/corpora/NIST/SRE/sitw_database.v4/dev/audio/abvwc.flac -t wav -r 16k -b 16 - channels 1 | adfcc sox -t flac /export/common/data/corpora/NIST/SRE/sitw_database.v4/dev/audio/adfcc.flac -t wav -r 16k -b 16 - channels 1 | adzuy sox -t flac /export/common/data/corpora/NIST/SRE/sitw_database.v4/dev/audio/adzuy.flac -t wav -r 16k -b 16 - channels 1 |

segments This will be the hardest file to generate. It's of the form <segment-id> <recording-id> <start-time> <end-time>. The egs/callhome_diarization recipe is concerned only with speaker diarization and we assume that this file has already been created by a speech activity detection system (SAD). However, in general, you will need to create this file yourself. You have several options. The easiest option (which will probably be perfectly adequate on relatively clean audio) is to use the energy-based SAD (e.g., run sid/compute_vad_decision.sh) to create frame-level speech/nonspeech decisions. Contiguous speech frames are then combined to create the segments you need using diarization/vad_to_segments.sh. Another option is to use an off the shelf SAD system. @vimalmanohar uploaded a model here: http://kaldi-asr.org/models/m4. Bear in mind you'd need to create a separate set of features for this system.

Regardless of what you use to compute segments, the file should look something like this: aahtm_0000 aahtm 0.03 2.17 aahtm_0001 aahtm 3.15 5.33 aahtm_0002 aahtm 6.79 7.78 aahtm_0003 aahtm 8.15 10.15 aahtm_0004 aahtm 10.48 13.45

utt2spk A file of the form <segment-id> <recording-id>, which you can create by just running awk '{$1, $2}' segments > utt2spk

aahtm_0000 aahtm aahtm_0001 aahtm aahtm_0002 aahtm aahtm_0003 aahtm aahtm_0004 aahtm

spk2utt This file is created from the utt2spk file running utils/utt2spk_to_spk2utt.pl utt2spk > spk2utt. Or you can run utils/fix_data_dir.sh on your directory.

aahtm aahtm_0000 aahtm_0001 aahtm_0002 aahtm_0003 aahtm_0004 aahtm_0005 aahtm_0006 aahtm_0007 aahtm_0008 aahtm_0009 aahtm_0010 aahtm_0011 aahtm_0012 aahtm_0013 aahtm_0014 aahtm_0015 aahtm_0016 aahtm_0017 aahtm_0018 aahtm_0019 aahtm_0020 aahtm_0021 aahtm_0022 aahtm_0023 aahtm_0024 aahtm_0025 aahtm_0026 aahtm_0027 aahtm_0028 aahtm_0029

Now that your data is prepared, I'll try to walk you through the remaining steps. I'm using the variable $name in place of your dataset, so you might be able to just copy and paste these lines of code, and set the $name variable to whatever your dataset is called.

Make Features You'll need to run steps/make_mfcc.sh using the appropriate MFCC configuration given by the model you're using (it'll be in the conf directory). steps/make_mfcc.sh --mfcc-config conf/mfcc.conf --nj 60 \ --cmd "$train_cmd_intel" --write-utt2num-frames true \ data/$name exp/make_mfcc $mfccdir utils/fix_data_dir.sh data/$name

Then run local/nnet3/xvector/prepare_feats.sh to apply sliding window CMVN to the data and dump it to disk. local/nnet3/xvector/prepare_feats.sh --nj 60 --cmd "$train_cmd_intel" \ data/$name data/${name}_cmn exp/${name}_cmn Copy the segments file to the new mean-normalized data directory: cp data/$name/segments data/${name}_cmn/ utils/fix_data_dir.sh data/${name}_cmn

Extract Embeddings Now we're going to extract embeddings from subsegments of the segments file. I usually use the configuration below, where the embeddings are extracted from an (at most) 1.5 second window, with a 50% overlap between subsegments (since the period is 0.75 of a second). diarization/nnet3/xvector/extract_xvectors.sh --cmd "$train_cmd_intel --mem 5G" \ --nj 60 --window 1.5 --period 0.75 --apply-cmn false \ --min-segment 0.5 $nnet_dir \ data/${name}_cmn $nnet_dir/xvectors_${name}

Perform PLDA scoring Now that we have many embeddings extracted from short overlapping windows of the recordings, we use PLDA to compute their pair-wise similarity. If you use a pretrained model, look in the README to see where a PLDA model lives. diarization/nnet3/xvector/score_plda.sh --cmd "$train_cmd_intel --mem 4G" \ --target-energy 0.9 --nj 20 $nnet_dir/xvectors_plda/ \ $nnet_dir/xvectors_$name \ $nnet_dir/xvectors_$name/plda_scores

Cluster Speakers The previous step gave you a matrix of similarity scores between each segment. This next step performs agglomerative hierarchical clustering to partition the recording into speech belonging to different speakers.

If you know know how many speakers are in your recordings (say it's summed channel telephone speech, you can assume there's probably 2 speakers) you can supply a file called reco2num_spk to the option --reco2num-spk. This is a file of the form <recording-id> <number-of-speakers-in-that-recording>. diarization/cluster.sh --cmd "$train_cmd_intel --mem 4G" --nj 20 \ --reco2num-spk data/$name/reco2num_spk \ $nnet_dir/xvectors_$name/plda_scores \ $nnet_dir/xvectors_$name/plda_scores_num_speakers

Or, you may not know how many speakers are in a recording (say it's a random online video). Then you'll need to specify a threshold at which to stop clustering. E.g., once the pair-wise similarity of the embeddings drops below this threshold, stop clustering. You might obtain this by finding a threshold that minimizes the diarization error rate (DER) on a development set. But, this won't be possible if you don't have segment-level labels for a dataset. If you don't have these labels, @dpovey suggested clustering a set of in-domain data, and tuning the threshold until it gives you the average number of speakers per recording that you expect (e.g., you might expect that there's on average 2 speakers per recording, but sometimes more or less). diarization/cluster.sh --cmd "$train_cmd_intel --mem 4G" --nj 40 \ --threshold $threshold \ $nnet_dir/xvectors_$name/plda_scores \ $nnet_dir/xvectors_$name/plda_scores_threshold_${threshold}

Diarized Speech If everything went well, you should have a file called rttm in the directory $nnetdir/xvectors$name/plda_scoresthreshold${threshold}/. The 2nd column is the recording ID, the 3rd column is the start-time of a segment, and the 4th is the time offset. The 8th column is the speaker label assigned to that segment.

SPEAKER mfcny 0 86.200 16.400 <NA> <NA> 1 <NA> <NA> SPEAKER mfcny 0 103.050 5.830 <NA> <NA> 1 <NA> <NA> SPEAKER mfcny 0 109.230 4.270 <NA> <NA> 1 <NA> <NA> SPEAKER mfcny 0 113.760 8.625 <NA> <NA> 1 <NA> <NA> SPEAKER mfcny 0 122.385 4.525 <NA> <NA> 2 <NA> <NA> SPEAKER mfcny 0 127.230 6.230 <NA> <NA> 2 <NA> <NA> SPEAKER mfcny 0 133.820 0.850 <NA> <NA> 2 <NA> <NA>

I started off with your guide and created a wav.scp file. I wanted to create a segment file and hence called off compute_vad_decision.sh with wav.scp, it returned with an error of feats.scp file. How will I be able to generate this file as there are no raw_mfcc_features present in pre-trained model. BTW I am using SRE16 pre-trained model (3 rd in the models page of kaldi).

david-ryan-snyder commented 5 years ago

I started off with your guide and created a wav.scp file. I wanted to create a segment file and hence called off compute_vad_decision.sh with wav.scp, it returned with an error of feats.scp file. How will I be able to generate this file as there are no raw_mfcc_features present in pre-trained model. BTW I am using SRE16 pre-trained model (3 rd in the models page of kaldi).

You have to generate the features (which are MFCCs) yourself. They're never distributed as part of a pretrained model.

serendipity24 commented 4 years ago

Hi, I am training a diarization model using custom data. I am using emr.ai corpus and the callhome_diarisation/v2 recipe to train it. As a mock run through the script I have supplied a portion of emr corpus under the sre/ name. Here the subsegments are not being generated(blank files). Are there any particular parameters that I need to take care of depending on the durations in the segments file?

Thank you.

anderleich commented 4 years ago

With just one reocrding (wav file) I create the reco2num_spk file as such: recording1 2 and the wav.scp file as such: recording1 absolyte/path/to/file

I get an error when clustering: ERROR (agglomerative-cluster[5.5.794~1-2b62]:Value():util/kaldi-table-inl.h:2402) Value() called but no such key recoding1-0000000-0000659 in archive /XXXXXXXX/0003_sre16_v2_1a/data/reco2num_spk

Any clues?

danpovey commented 4 years ago

The scores should, it seems, be indexed by recording, not utterance-id, so an entry in the scores archive should, I think, be something like: recording-id matrix-of-scores,#utts by #utts. Yours seems to be indexed by utterance.

On Thu, Sep 10, 2020 at 9:51 PM anderleich notifications@github.com wrote:

With just one reocrding (wav file) I create the reco2num_spk file as such: recording1 2 and the wav.scp file as such: recording1 absolyte/path/to/file

I get an error when clustering: ERROR (agglomerative-cluster[5.5.794~1-2b62]:Value():util/kaldi-table-inl.h:2402) Value() called but no such key recoding1-0000000-0000659 in archive /XXXXXXXX/0003_sre16_v2_1a/data/reco2num_spk

Any clues?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2523#issuecomment-690302837, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOYHKRZEZ3RV5PI5LITSFDKXLANCNFSM4FHFW2BQ .

anderleich commented 4 years ago

I don't understand what you call recording. The actual wav file? These are my files

wav.scp recording1 /absolute/path/to/wav segments utt_0001 recording1 0.00 1.23 utt_0002 recording1 1.45 4.56 ... reco2num_spk recording1 2 utt2spk utt_0001 utt_0001 utt_0002 utt_0002

I guess I should obtain the speaker for each utterance (utt_0001, utt_0002...)

danpovey commented 4 years ago

I think it creates utterance-ids within those recordings, that may have time marks on them. But yes, recording1 is the recording, and recording1-xxxxxxx-xxxxxx would be the utterance. I am not super familiar with the diarization scripts.

On Thu, Sep 10, 2020 at 11:09 PM anderleich notifications@github.com wrote:

I don't understand what you call recording. The actual wav file? These are my files

wav.scp recording1 /absolute/path/to/wav segments utt_0001 recording1 0.00 1.23 utt_0002 recording1 1.45 4.56 ... reco2num_spk recording1 2 utt2spk utt_0001 utt_0001 utt_0002 utt_0002

I guess I should obtain the speaker for each utterance (utt_0001, utt_0002...)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2523#issuecomment-690354602, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZBLC56HXMPTHQGCRTSFDT2RANCNFSM4FHFW2BQ .