Closed PES2g closed 5 years ago
We didn't run any evaluations on ICSI, because we didn't find any benchmark on this dataset, thus there is no good baseline to compare with.
About the poor performance you are seeing on ICSI, here are a few possible reasons I have in mind:
Thanks for your detailed explanation.
In my experiment, during training of UIS-RNN, i used part of ICSI data as training data.
But for embeddings, the amount of audios in ICSI is small compared to training dataset of speaker embedding, so i fine-tune the speaker embedding on ICSI, then i use verification accuracy to judge the performance of the fine-tuned speaker embedding, i found the promotion is limited. So for a specific new scenery, large number of similar speaker utterances is needed to train speaker embedding.
At last, i want to know if you have some intuition about the amount of data (how many hours) which is enough for training UIS-RNN.
For the five-fold experiments on SRE 2000 disk-8 (CALLHOME), each fold is using 400 utterances for training, and 100 utterance for testing. Each utterance is about 1min long (some are longer).
So in this set up training data is about 400min.
During experiments, for conversational telephone, the model's performance is fine. But the model's performance degrade seriously for multi-person meeting scenario, such as ICSI. For ICSI, Confusion error could be 30%. And only DER for NIST SRE 2000 CALLHOME is provided in the paper. As in your paper, you guys use ICSI as part of training set, do you test the performance of model on the ICSI ?