Closed kukas closed 5 years ago
First, can you please share the papers you are talking about?
Note that I am now in the process of retraining everything from scratch and then share the pre-trained models (no ETA, though) to make sure everybody can at least re-run the whole pipeline.
Hi, I could get down to 6.7% in SAD on AMI training data after 1000 epochs by changing feature parameters as described in "https://arxiv.org/pdf/1609.04301.pdf". For SCD they do not really give better results. Additionally, I tested SCD on my own files, and this approach seems to have problems with generalization. It would be great if anyone had hints on choosing the Peak/Binarize parameters for better generalization. Did anyone tried training on other datasets?
First, can you please share the papers you are talking about?
I meant your paper mentioned in the README: https://www.isca-speech.org/archive/Interspeech_2018/pdfs/1750.pdf The reported DER is around 25%, so my result seems too high to me.
Note that I am now in the process of retraining everything from scratch and then share the pre-trained models (no ETA, though) to make sure everybody can at least re-run the whole pipeline.
Thank you! That would certainly help! I will keep an eye on your repository :-)
Hi, I could get down to 6.7% in SAD on AMI training data after 1000 epochs by changing feature parameters as described in "https://arxiv.org/pdf/1609.04301.pdf". For SCD they do not really give better results. Additionally, I tested SCD on my own files, and this approach seems to have problems with generalization. It would be great if anyone had hints on choosing the Peak/Binarize parameters for better generalization. Did anyone tried training on other datasets?
Thanks for sharing your results, did the better SAD performance affect the final DER?
First, can you please share the papers you are talking about?
I meant your paper mentioned in the README: https://www.isca-speech.org/archive/Interspeech_2018/pdfs/1750.pdf The reported DER is around 25%, so my result seems too high to me.
This 25% is on a different (ETAPE, easier) dataset so it cannot really be compared. AMI contains recording of meeting with spontaneous and overlapping speech. ETAPE is broadcast news with mostly prepared speech.
Is there a reference for SOTA (or, even just competitive) DER on the AMI corpus? A preliminary search didn't yield much.
Anecdotally I found that minor improvements to SAD translated into a small (but noticeable) impact on overall DER. Using mostly default configurations I achieve a test diarization error of ~46% and a dev error of ~44% on AMI. Below are a few general things that might be useful for improving model performance on AMI (though I should note I have very little experience in this domain).
Yin2018
uses affinity propagation to cluster the learned speaker embeddings. It might be useful to also try using agglomerative hierarchical clustering with a prior on the number of speakers (Garcia-Romero et al., 2017 report using a simple geometric decay on the number of speakers is broadly useful), or spectral clustering (e.g., Wang et al. 2017)@hbredin, for data of this sort (i.e., noisy + overlapping spontaneous speech with multiple speakers), are there any modeling strategies / approaches that seem promising? If so, I'd be happy to try to implement some of them and submit a PR!
Hi, I could get down to 6.7% in SAD on AMI training data after 1000 epochs by changing feature parameters as described in "https://arxiv.org/pdf/1609.04301.pdf". For SCD they do not really give better results. Additionally, I tested SCD on my own files, and this approach seems to have problems with generalization. It would be great if anyone had hints on choosing the Peak/Binarize parameters for better generalization. Did anyone tried training on other datasets? @PiotrTa same situation... Wondering if somebody can generelize this a litte bit, especially on real speech(like phone conversation).
Hi, I could get down to 6.7% in SAD on AMI training data after 1000 epochs by changing feature parameters as described in "https://arxiv.org/pdf/1609.04301.pdf". For SCD they do not really give better results. Additionally, I tested SCD on my own files, and this approach seems to have problems with generalization. It would be great if anyone had hints on choosing the Peak/Binarize parameters for better generalization. Did anyone tried training on other datasets? @PiotrTa same situation... Wondering if somebody can generelize this a litte bit, especially on real speech(like phone conversation).
@ZhuoranLyu Probably using an embedding network and neighboring sliding windows is a way to go if you aim for an SCD. Dataset collection is somewhat easier. An important aspect is that you have many (thousands) different speakers in the dataset. Also including noisy samples should help with generalization.
@PiotrTa However, it's not easy to get enough data for speaker diarization task, especially for a specific language. Hence I tried some conventional approach like BIC, which even give a better results.
FYI, it took me some time but I just pushed:
With those models, I get a DER (with no collar) on AMI test set of around 33% (half of which is because of missed detection due to overlapping speech).
I also would like to take this opportunity to answer @ddbourgin suggestions (sorry for the delay).
pyannote.pipeline
library is a first step in this direction (and is now used by pyannote.audio
).pyannote.audio
) seems counter-intuitive. But, it does make more sense for embeddings trained as internal layers of speaker classification networks (for which there is no reason cosine or euclidean distance is optimal).@ddbourgin If you are still interested in contributing a few of those ideas, let me know. Here are a few things that would be helpful to try:
pyannote-audio
is mostly LSTM for now).Closing as I believe the original issue has been addressed. To continue the discussion, please open a new issue.
Hi, I tried to follow the tutorial using AMI dataset but my pipeline reaches only 48% DER. According to the papers, I expected the performance to be 2-3x better, so I might have made some mistake along the way, would anyone have a bit of time to compare my results with theirs? I will be grateful for any help.
Here is some information about the pipeline modules: SCD - trained for 714 iterations, 23% train EER, 0.24 train loss SAD - trained for 897 iterations, 7,9% train EER, 0.135 train loss EMB - trained for 3330 iterations, train loss=0.0791 pipeline -
Best = 48.251% after 280 trials
(on development set)I used all the default config files from the tutorials and here are the commands I used to get these results (I stopped the training commands with a kill signal):
Kind regards