srvk / DiViMe

ACLEW Diarization Virtual Machine
Apache License 2.0
32 stars 9 forks source link

which dataset Yunitator is trained on #62

Closed jihopark closed 6 years ago

jihopark commented 6 years ago

It would be important to specify which dataset Yunitator is trained with (or am I the only one who is missing the info?), because users can try out samples that may be included in the training dataset.

I am testing the library out with samples from Homebank (VanDam-5minutes) and found out that Yunitator works better than DiarTK. I was delighted but got suspicious whether the sample is drowned from the training set. I hope the performance comes from the fact that it is trained from other datasets with similar distribution as Homebank.

I would appreciate it if you can clarify here!

alecristia commented 6 years ago

Thanks, this is indeed in the todo list! #39

riebling commented 6 years ago

In one word, the Yunitator was trained on the "noiseme" corpus:

S. Burger, Q. Jin, P. F. Schulam, and F. Metze, “Noisemes: manual annotation of environmental noise in audio streams”, technical report CMU-LTI-12-07, Carnegie Mellon University, 2012. 

Description of how this network was trained can be found in Section 3 of this paper: http://www.cs.cmu.edu/~yunwang/papers/icassp16.pdf

And improvements to it can be found in Section 3.2 of this paper: http://www.cs.cmu.edu/~yunwang/papers/icassp17.pdf

fmetze commented 6 years ago

Are we sure? I thought this question refers to the tool that does 4-way male/ female/ child/ silence segmentation - and this tool was not trained on the “noiseme” corpus. Not sure though which tool is the rightful “Yunitator”...

On Oct 22, 2018, at 4:19 PM, Eric Riebling notifications@github.com wrote:

In one word, the Yunitator was trained on the "noiseme" corpus:

S. Burger, Q. Jin, P. F. Schulam, and F. Metze, “Noisemes: manual annotation of environmental noise in audio streams”, technical report CMU-LTI-12-07, Carnegie Mellon University, 2012. Description of how this network was trained can be found in Section 3 of this paper: http://www.cs.cmu.edu/~yunwang/papers/icassp16.pdf http://www.cs.cmu.edu/~yunwang/papers/icassp16.pdf And improvements to it can be found in Section 3.2 of this paper: http://www.cs.cmu.edu/~yunwang/papers/icassp17.pdf http://www.cs.cmu.edu/~yunwang/papers/icassp17.pdf — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/srvk/DiViMe/issues/62#issuecomment-431967592, or mute the thread https://github.com/notifications/unsubscribe-auth/AEnA8ZG8nKHEK-bzRb0SZmuCAqXwHQJIks5unig6gaJpZM4XzB96.

jihopark commented 6 years ago

Yes I am talking about the model @florian is referring to. I read the noisemes paper and it is classifying different kinds of noise. Correct me if I am wrong! On Tue, Oct 23, 2018 at 9:38 AM Florian Metze notifications@github.com wrote:

Are we sure? I thought this question refers to the tool that does 4-way male/ female/ child/ silence segmentation - and this tool was not trained on the “noiseme” corpus. Not sure though which tool is the rightful “Yunitator”...

On Oct 22, 2018, at 4:19 PM, Eric Riebling notifications@github.com wrote:

In one word, the Yunitator was trained on the "noiseme" corpus:

S. Burger, Q. Jin, P. F. Schulam, and F. Metze, “Noisemes: manual annotation of environmental noise in audio streams”, technical report CMU-LTI-12-07, Carnegie Mellon University, 2012. Description of how this network was trained can be found in Section 3 of this paper: http://www.cs.cmu.edu/~yunwang/papers/icassp16.pdf < http://www.cs.cmu.edu/~yunwang/papers/icassp16.pdf> And improvements to it can be found in Section 3.2 of this paper: http://www.cs.cmu.edu/~yunwang/papers/icassp17.pdf < http://www.cs.cmu.edu/~yunwang/papers/icassp17.pdf> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/srvk/DiViMe/issues/62#issuecomment-431967592>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AEnA8ZG8nKHEK-bzRb0SZmuCAqXwHQJIks5unig6gaJpZM4XzB96 .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/srvk/DiViMe/issues/62#issuecomment-432053444, or mute the thread https://github.com/notifications/unsubscribe-auth/AAVBsbV34amT5xqkmVsw_biNj6mS_G98ks5unnMjgaJpZM4XzB96 .

-- Park Ji Ho - Research Postgraduate in HKUST Human Language Technology Center

alecristia commented 6 years ago

I'm working on the docs, here's a preview:

Yunitator

There is no reference for this tool.

General intro

Given that there is no reference for this tool, we provide a more extensive introduction based on a presentation Florian Metze gave on 2018-08-13 in an ACLEW Meeting.

The data used for training were:

Talker identity annotations collapsed into the following 4 types:

The features were MED (multimedia event detection) feature, extracted with OpenSMILE. They were extracted in 2s windows moving 100ms each step. There were 6,669 dims at first, PCA’ed down to 50 dims

The model was a RNN, with 1 bidirectional GRU layer and 200 units in each direction. There was a softmax output layer, which therefore doesn’t predict overlaps..

The training regime used 5-fold cross-validation, with 5 models trained on 4/5 of the data and tested on the remainder. The outputs are poooled together to measure performance. The final model was trained on all the data.

The loss function was cross entropy with classes weighted by 1/prior. The batch size was 5 sequences of 625 frames (in order to accommodate the fact that many of the clips were 1 minute long). The optimizer was Adam, the inital LR was .001 and the LR schedule was *=.999 every epoch.

The resulting F1 for the key classes were:

jihopark commented 6 years ago

thank you so much @alecristia. As I guessed, the sample I used for testing was in the training data, so it was performing very well.

alecristia commented 6 years ago

new version of the docs are up

riebling commented 6 years ago

Realized as soon as I posted, the training for "Yunitator" was different than "Noisemes" even though both created by Yun, they are different models and predict different classes, and Yun's answer which I copy-pasted from elsewhere was for the wrong flavour of tool.

jihopark commented 6 years ago

@riebling sorry if I ask too many questions.

I found out that u are doing a PCA transformation after extracting features from OpenSmile. The weights are saved in https://github.com/srvk/Yunitator/pca-self.pkl. Can you give me more details about how these weights are trained? On which dataset? Is it trained on an aggregate of all the vectors (one vector per frame) from the training audios?

alecristia commented 6 years ago

Here's Yun's answer:

"The PCA matrix was computed from all the training frames. ... Yun Wang and Florian Metze, "A first attempt at polyphonic sound event detection using connectionist temporal classification", in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2986-2990, Mar. 2017. "