pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
http://pyannote.github.io
MIT License
5.49k stars 725 forks source link

High DER on AMI dataset #136

Closed kukas closed 5 years ago

kukas commented 5 years ago

Hi, I tried to follow the tutorial using AMI dataset but my pipeline reaches only 48% DER. According to the papers, I expected the performance to be 2-3x better, so I might have made some mistake along the way, would anyone have a bit of time to compare my results with theirs? I will be grateful for any help.

Here is some information about the pipeline modules: SCD - trained for 714 iterations, 23% train EER, 0.24 train loss SAD - trained for 897 iterations, 7,9% train EER, 0.135 train loss EMB - trained for 3330 iterations, train loss=0.0791 pipeline - Best = 48.251% after 280 trials (on development set)

I used all the default config files from the tutorials and here are the commands I used to get these results (I stopped the training commands with a kill signal):

iterations=1000
train_protocol="AMI.SpeakerDiarization.MixHeadset"

pyannote-speech-feature tutorials/feature-extraction ${train_protocol}
pyannote-speech-detection train --to=$iterations tutorials/speech-activity-detection ${train_protocol}
pyannote-change-detection train --to=$iterations tutorials/change-detection ${train_protocol}
pyannote-speaker-embedding train --to=$iterations tutorials/speaker-embedding ${train_protocol}
pyannote-speech-detection apply tutorials/speech-activity-detection/train/${train_protocol}.train/weights/0897.pt ${train_protocol} tutorials/pipeline/sad
pyannote-change-detection apply tutorials/change-detection/train/${train_protocol}.train/weights/0714.pt ${train_protocol} tutorials/pipeline/scd
pyannote-speaker-embedding apply tutorials/speaker-embedding/train/${train_protocol}.train/weights/3330.pt ${train_protocol} tutorials/pipeline/emb

pyannote-pipeline train --trials=$iterations tutorials/pipeline ${train_protocol}

Kind regards

hbredin commented 5 years ago

First, can you please share the papers you are talking about?

Note that I am now in the process of retraining everything from scratch and then share the pre-trained models (no ETA, though) to make sure everybody can at least re-run the whole pipeline.

PiotrTa commented 5 years ago

Hi, I could get down to 6.7% in SAD on AMI training data after 1000 epochs by changing feature parameters as described in "https://arxiv.org/pdf/1609.04301.pdf". For SCD they do not really give better results. Additionally, I tested SCD on my own files, and this approach seems to have problems with generalization. It would be great if anyone had hints on choosing the Peak/Binarize parameters for better generalization. Did anyone tried training on other datasets?

kukas commented 5 years ago

First, can you please share the papers you are talking about?

I meant your paper mentioned in the README: https://www.isca-speech.org/archive/Interspeech_2018/pdfs/1750.pdf The reported DER is around 25%, so my result seems too high to me.

Note that I am now in the process of retraining everything from scratch and then share the pre-trained models (no ETA, though) to make sure everybody can at least re-run the whole pipeline.

Thank you! That would certainly help! I will keep an eye on your repository :-)

Hi, I could get down to 6.7% in SAD on AMI training data after 1000 epochs by changing feature parameters as described in "https://arxiv.org/pdf/1609.04301.pdf". For SCD they do not really give better results. Additionally, I tested SCD on my own files, and this approach seems to have problems with generalization. It would be great if anyone had hints on choosing the Peak/Binarize parameters for better generalization. Did anyone tried training on other datasets?

Thanks for sharing your results, did the better SAD performance affect the final DER?

hbredin commented 5 years ago

First, can you please share the papers you are talking about?

I meant your paper mentioned in the README: https://www.isca-speech.org/archive/Interspeech_2018/pdfs/1750.pdf The reported DER is around 25%, so my result seems too high to me.

This 25% is on a different (ETAPE, easier) dataset so it cannot really be compared. AMI contains recording of meeting with spontaneous and overlapping speech. ETAPE is broadcast news with mostly prepared speech.

ddbourgin commented 5 years ago

Is there a reference for SOTA (or, even just competitive) DER on the AMI corpus? A preliminary search didn't yield much.

Anecdotally I found that minor improvements to SAD translated into a small (but noticeable) impact on overall DER. Using mostly default configurations I achieve a test diarization error of ~46% and a dev error of ~44% on AMI. Below are a few general things that might be useful for improving model performance on AMI (though I should note I have very little experience in this domain).

@hbredin, for data of this sort (i.e., noisy + overlapping spontaneous speech with multiple speakers), are there any modeling strategies / approaches that seem promising? If so, I'd be happy to try to implement some of them and submit a PR!

ZhuoranLyu commented 5 years ago

Hi, I could get down to 6.7% in SAD on AMI training data after 1000 epochs by changing feature parameters as described in "https://arxiv.org/pdf/1609.04301.pdf". For SCD they do not really give better results. Additionally, I tested SCD on my own files, and this approach seems to have problems with generalization. It would be great if anyone had hints on choosing the Peak/Binarize parameters for better generalization. Did anyone tried training on other datasets? @PiotrTa same situation... Wondering if somebody can generelize this a litte bit, especially on real speech(like phone conversation).

PiotrTa commented 5 years ago

Hi, I could get down to 6.7% in SAD on AMI training data after 1000 epochs by changing feature parameters as described in "https://arxiv.org/pdf/1609.04301.pdf". For SCD they do not really give better results. Additionally, I tested SCD on my own files, and this approach seems to have problems with generalization. It would be great if anyone had hints on choosing the Peak/Binarize parameters for better generalization. Did anyone tried training on other datasets? @PiotrTa same situation... Wondering if somebody can generelize this a litte bit, especially on real speech(like phone conversation).

@ZhuoranLyu Probably using an embedding network and neighboring sliding windows is a way to go if you aim for an SCD. Dataset collection is somewhat easier. An important aspect is that you have many (thousands) different speakers in the dataset. Also including noisy samples should help with generalization.

ZhuoranLyu commented 5 years ago

@PiotrTa However, it's not easy to get enough data for speaker diarization task, especially for a specific language. Hence I tried some conventional approach like BIC, which even give a better results.

hbredin commented 5 years ago

FYI, it took me some time but I just pushed:

With those models, I get a DER (with no collar) on AMI test set of around 33% (half of which is because of missed detection due to overlapping speech).

I also would like to take this opportunity to answer @ddbourgin suggestions (sorry for the delay).

@ddbourgin If you are still interested in contributing a few of those ideas, let me know. Here are a few things that would be helpful to try:

hbredin commented 5 years ago

Closing as I believe the original issue has been addressed. To continue the discussion, please open a new issue.