tango4j / Auto-Tuning-Spectral-Clustering

This repo is for the SPL paper "Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap"
MIT License
109 stars 15 forks source link

Python Speaker Diarization

Spectral Clustering Python

Speaker Diarization Spectral Clustering

Auto Tuning Spectral Clustering for SpeakerDiarization Using Normalized Maximum Eigengap

@article{park2019auto, title={Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap}, author={Park, Tae Jin and Han, Kyu J and Kumar, Manoj and Narayanan, Shrikanth}, journal={IEEE Signal Processing Letters}, year={2019}, publisher={IEEE} }

Features of Auto-tuning NME-SC method

Auto-tuning NME-SC poposed method -

Performance Table

Track 1: Oracle VAD

System CALLHOME CHAES-eval CH109 RT03(SW) AMI
Kaldi PLDA + AHC [1] 8.39% 24.27% 9.72% 1.73% - %
Spectral Clustering COS+B-SC [2] 8.78% 4.4% 2.25% 0.88% - %
Auto-Tuning COS+NME-SC [2] 7.29% 2.48% 2.63% 2.21% - %
Auto-Tuning COS+NME-SC Sparse-Search-20 [2] 7.24% 2.48% 2.00% 0.92% 4.21%

Track 2: System VAD

System CALLHOME CHAES-eval CH109 RT03(SW)
Kaldi PLDA + AHC [1] 6.64%
(12.96%)
1.45%
(5.52%)
2.6%
(6.89%)
0.99%
(3.53%)
Spectral Clustering COS+B-SC [2] 6.91%
(13.23%)
1.00%
(5.07%)
1.46%
(5.75%)
0.56%
(3.1%)
Auto-Tuning COS+NME-SC [2] 5.41%
(11.73%)
0.97%
(5.04%)
1.32%
(5.61%)
0.59%
(3.13%)
Auto-Tuning COS+NMME-SC Sparse-Search-20 [2] 5.55%
(11.87%)
1.00%
(5.06%)
1.42%
(5.72%)
0.58%
(3.13%)

Datasets

CALLHOME NIST SRE 2000 (LDC2001S97): The most popular diarization dataset.
CHAES-eval CALLHOME American English Subset (CHAES) (LDC97S42): English corpora for speaker diarization. train/valid/eval set.
CH-109 (LDC97S42): Sessions with 2 speakers in CHAES. Usually tested by providing the number of speakers.
RT03(SW) (LDC2007S10) : SwitchBoard part of RT03 dataset.

Reference

[1] PLDA + AHC, Callhome Diarization Xvector Model
[2] Tae Jin Park et. al., Auto Tuning Spectral Clustering for SpeakerDiarization Using Normalized Maximum Eigengap, IEEE Singal Processing Letters, 2019

Getting Started

TLDR; One-click demo script

source run_demo_clustering.sh

Prerequisites

Installing

You have to first have virtualenv installed on your machine. Install virtualenv with the following command:

sudo pip3 install virtualenv 

If you installed virtualenv, run the "install_venv.sh" script to make a virtual-env.

source install_venv.sh

This command will create a folder named "env_nmesc".

Usage Example

You need to prepare the followings:

  1. Segmentation files in Kaldi style format:

ex) segments

iaaa-00000-00327-00000000-00000150 iaaa 0 1.5
iaaa-00000-00327-00000075-00000225 iaaa 0.75 2.25
iaaa-00000-00327-00000150-00000300 iaaa 1.5 3
...
iafq-00000-00272-00000000-00000150 iafq 0 1.5
iafq-00000-00272-00000075-00000225 iafq 0.75 2.25
iafq-00000-00272-00000150-00000272 iafq 1.5 2.72
  1. Affinity matrix files in Kaldi scp/ark format: Each affinity matrix file should be N by N square matrix.
  2. Speaker embedding files: If you don't have affinity matrix, you can calculate cosine similarity ark files using _./sc_utils/scoreembedding.sh. See run_demo_clustering.sh file to see how to calcuate cosine similarity files. (You can choose scp/ark or npy)

Running the python code with arguments:

python spectral_opt.py --distance_score_file $DISTANCE_SCORE_FILE \
                       --threshold $threshold \
                       --score-metric $score_metric \
                       --max_speaker $max_speaker \
                       --spt_est_thres $spt_est_thres \
                       --segment_file_input_path $SEGMENT_FILE_INPUT_PATH \
                       --spk_labels_out_path $SPK_LABELS_OUT_PATH \
                       --reco2num_spk $reco2num_spk 

Arguments:

If you want to use .npy numpy file as an affinity matrix

DISTANCE_SCORE_FILE=$PWD/sample_CH_xvector/cos_scores/scores.txt

Two options are available:  

(1) scores.scp: Kaldi style scp file that contains the absolute path to .ark files and its binary address. Space separted \<utt_id\> and \<path\>.

ex) scores.scp

iaaa /path/sample_CH_xvector/cos_scores/scores.1.ark:5 iafq /path/sample_CH_xvector/cos_scores/scores.1.ark:23129 ...


(2) scores.txt: List of <utt_id> and the absolute path to .npy files.  
ex) scores.txt

iaaa /path/sample_CH_xvector/cos_scores/iaaa.npy iafq /path/sample_CH_xvector/cos_scores/iafq.npy ...

* **score-metric**: Use 'cos' to apply for affinity matrix based on cosine similarity.  
ex) 
```bash
score_metric='cos'

Or you can use NMESC in the paper to estimate the threshold.

spt_est_thres='NMESC' threshold='None'

Or you can specify different threshold for each utterance.

spt_est_thres="thres_utts.txt" threshold='None'

thres_utts.txt has a format as follows:
<utt_id> <threshold>  

ex) thres_utts.txt

iaaa 0.105 iafq 0.215 ...


* **segment_file_input_path**: "segments" file in Kaldi format. This file is also necessary for making rttm file and calculating DER.
```bash
segment_file_input_path=$PWD/sample_CH_xvector/xvector_embeddings/segments

ex) segments

iaaa-00000-00327-00000000-00000150 iaaa 0 1.5
iaaa-00000-00327-00000075-00000225 iaaa 0.75 2.25
iaaa-00000-00327-00000150-00000300 iaaa 1.5 3
...
iafq-00000-00272-00000000-00000150 iafq 0 1.5
iafq-00000-00272-00000075-00000225 iafq 0.75 2.25
iafq-00000-00272-00000150-00000272 iafq 1.5 2.72

Cosine similarity calculator script

Running the python code for cosine similarity calculation:

data_dir=$PWD/sample_CH_xvector
pushd $PWD/sc_utils
text_yellow_info "Starting Script: affinity_score.py"
./score_embedding.sh --cmd "run.pl --mem 5G" \
                     --score-metric $score_metric \
                      $data_dir/xvector_embeddings \
                      $data_dir/cos_scores 
popd

Expected output result of one-click script

$ source run_demo_clustering.sh 
=== [INFO] The python_envfolder exists: /.../Auto-Tuning-Spectral-Clustering/env_nmesc 
=== [INFO] Cosine similariy scores exist: /.../Auto-Tuning-Spectral-Clustering/sample_CH_xvector/cos_scores 
=== [INFO] Running Spectral Clustering with .npy input... 
=== [INFO] .scp file and .ark files were provided
Scanning eig_ratio of length [19] mat size [76] ...
1  score_metric: cos  affinity matrix pruning - threshold: 0.105  key: iaaa Est # spk: 2  Max # spk: 8  MAT size :  (76, 76)
Scanning eig_ratio of length [15] mat size [62] ...
2  score_metric: cos  affinity matrix pruning - threshold: 0.194  key: iafq Est # spk: 2  Max # spk: 8  MAT size :  (62, 62)
Method: Spectral Clustering has been finished 
=== [INFO] Computing RTTM 
=== [INFO] RTTM calculation was successful. 
=== [INFO] NMESC auto-tuning | Total Err. (DER) -[ 0.32 % ] Speaker Err. [ 0.32 % ] 
=== [INFO] .scp file and .ark files were provided
1  score_metric: cos  affinity matrix pruning - threshold: 0.050  key: iaaa Est # spk: 2  Max # spk: 8  MAT size :  (76, 76)
2  score_metric: cos  affinity matrix pruning - threshold: 0.050  key: iafq Est # spk: 5  Max # spk: 8  MAT size :  (62, 62)
Method: Spectral Clustering has been finished 
=== [INFO] Computing RTTM 
=== [INFO] RTTM calculation was successful. 
=== [INFO] Threshold 0.05 | Total Err. (DER) -[ 20.57 % ] Speaker Err. [ 20.57 % ] 
Loading reco2num_spk file:  reco2num_spk
=== [INFO] .scp file and .ark files were provided
1  score_metric: cos  Rank based pruning - RP threshold: 0.0500  key: iaaa  Given Number of Speakers (reco2num_spk): 2  MAT size :  (76, 76)
2  score_metric: cos  Rank based pruning - RP threshold: 0.0500  key: iafq  Given Number of Speakers (reco2num_spk): 2  MAT size :  (62, 62)
Method: Spectral Clustering has been finished 
=== [INFO] Computing RTTM 
=== [INFO] RTTM calculation was successful. 
=== [INFO] Known Num. Spk. | Total Err. (DER) -[ 0.15 % ] Speaker Err. [ 0.15 % ] 

Authors

Tae Jin Park: inctrljinee@gmail.com, tango4j@gmail.com
Kyu J.
Manoj Kumar
Shrikanth Narayanan