Closed MarvinLvn closed 5 years ago
final decision on this was to remove all references to yunisegs from the doc and remove the tool, as it is inappropriate
so we have no need to run Yunitator as diarization only, given pre-computed SAD? (And maybe it should have been called YuniSad ... oh wait no that's worse. YuniDiar?) ok by me, it was worth trying out, experimentally!
I've run some tests in order to evaluate Yunitator on the tsimane lena dataset. Below, you'll find the confusion matrix that I got from the analysis :
To get that, I "framarized" the output of each model with a frame length equal to 10 ms.
Reminder to guide you : The cell D3 can be read as "Yunitator raw flavor classifies 11803 times a female speech activity as being a child speech activity". The cell G3 can be read as "54% of the frames classified as child speech activity by Yunitator raw flavor are correct". The cell E7 can be read as "20% of the MAL frames (gold) are correctly classified by Yunitator raw flavor". The cell H3 can be read as "Yunitator raw flavor classified a frame as being child speech activity 93386 times". The cell E8 can be read as "31147 frames correspond to male adult speech activity in the gold rttm".
What is striking for me, is that the FEM class is over-represented when using Yunitator on segments classified as speech (by the gold, or by toCombo). I did some maths to illustrate my point (I'm not taking into account the SIL class) :
And, we observed that the class FEM represents 27% of the gold rttm file. It's not too far from the results we got with Yunitator raw.
An immediate way to explain why the female speech is over-represented when using Yunitator with speech segments could be that there are some uniformization/scaling process of the speech whose coefficients have been computed during the training phase (on "natural speech", with several speakers, with some silence, etc ...). Or maybe the coefficients of the PCA ?
In the light of this analysis, I'm not quite sure if it's relevant to allow our users to use the same model on different data : natural recordings and short speech recordings. If our model perform well on one, it will be bad on the other.