google / uis-rnn

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.
https://arxiv.org/abs/1810.04719
Apache License 2.0
1.55k stars 320 forks source link

Performance degrade for multi-person meeting #29

Closed PES2g closed 5 years ago

PES2g commented 5 years ago

During experiments, for conversational telephone, the model's performance is fine. But the model's performance degrade seriously for multi-person meeting scenario, such as ICSI. For ICSI, Confusion error could be 30%. And only DER for NIST SRE 2000 CALLHOME is provided in the paper. As in your paper, you guys use ICSI as part of training set, do you test the performance of model on the ICSI ?

wq2012 commented 5 years ago

We didn't run any evaluations on ICSI, because we didn't find any benchmark on this dataset, thus there is no good baseline to compare with.

About the poor performance you are seeing on ICSI, here are a few possible reasons I have in mind:

  1. Quality of the embeddings. The diarization performance significantly depends on the quality of speaker embeddings. If you want to see good performance on ICSI, your speaker embedding training set should contain similar data (acoustic environment, microphone, accents, etc.).
  2. Training of UIS-RNN. The training data of UIS-RNN should also contain some data similar to ICSI. If all your training data are significantly different than ICSI, UIS-RNN is expected to fail, because it is supervised. The main purpose for UIS-RNN is for in-domain training-deployment.
  3. Some hyperparameters may need to be slightly changed to perform well on a new set. The default parameters perform well only on CALLHOME.
  4. The current open sourced UIS-RNN is an incomplete version, due to https://github.com/google/uis-rnn/issues/4. We are still working on this.
PES2g commented 5 years ago

Thanks for your detailed explanation.

In my experiment, during training of UIS-RNN, i used part of ICSI data as training data.

But for embeddings, the amount of audios in ICSI is small compared to training dataset of speaker embedding, so i fine-tune the speaker embedding on ICSI, then i use verification accuracy to judge the performance of the fine-tuned speaker embedding, i found the promotion is limited. So for a specific new scenery, large number of similar speaker utterances is needed to train speaker embedding.

At last, i want to know if you have some intuition about the amount of data (how many hours) which is enough for training UIS-RNN.

wq2012 commented 5 years ago

For the five-fold experiments on SRE 2000 disk-8 (CALLHOME), each fold is using 400 utterances for training, and 100 utterance for testing. Each utterance is about 1min long (some are longer).

So in this set up training data is about 400min.