google / uis-rnn

This is the library for the Unbounded Interleaved-State Recurrent Neural Network (UIS-RNN) algorithm, corresponding to the paper Fully Supervised Speaker Diarization.
https://arxiv.org/abs/1810.04719
Apache License 2.0
1.56k stars 319 forks source link

Understanding the use of resize_sequence() and the batch creation for RNN training #53

Closed vickianand closed 5 years ago

vickianand commented 5 years ago

From code what I understand is that resize_sequence() function is used to create a list of numpy-array, with observation vector from the same cluster in the same array. And optionally, it uses sample_permuted_segments() function to generate num_permutations number of permutations from each of those array. I think that by doing these we losing following two information about the data.

  1. The order of utterances are lost due to permutation and also due to collecting same-labeled segments from different parts of the utteraance.
  2. The order of entries within an utterance are also lost due to using sample_permuted_segments() function.

Someone please help me understand as to why we do these shuffling of sequences?

Thanks in advance!

wq2012 commented 5 years ago

Hi @vickianand, please see my responses below:

The order of utterances are lost due to permutation and also due to collecting same-labeled segments from different parts of the utteraance.

The order of utterances is not important, and we should NOT learn anything from that. Each utterance is completely independent, containing a full conversation from multiple speakers. Multiple utterances are just multiple examples for training. And training should not depend on the order of data reading.

The order of entries within an utterance are also lost due to using sample_permuted_segments() function.

True! But partially.

The segment permutation is considered as a data augmentation step, that sacrifices some of the ordering information, but adds more variation to the training. If you call fit on the same input twice, it will permute it (thus augment it) differently. This is important because diarization training sets are usually very small, since timestamped labels are expensive.

We admit that the permutation is not necessarily the best practice. It is what we found that works best for us. We didn't really explore all variations of the algorithm. If you find that there are better alternative solutions here, that would be some novel contributions. Feel free to share/publish that.

vickianand commented 5 years ago

Thank you for for you quick response @wq2012. Thinking of permutations as a data augmentation step make sense if it helps improve performance of validation set. So, I'll try with and without it and see if it is better for my use-case. Thank you again!