Closed MarvinLvn closed 5 years ago
Example of diarization evaluation output :
All metrics that have been integrated so far :
Diarization (Clusterization of speaker roles) :
Detection (Speech activity detection) :
Identification (Identification of speaker roles) :
A visualization module that extracts the 1 minute length chunk that has the highest volubility :
--> Still needs to be improved (I tried to set the segments transparency for an hour without succeeding ! T_T)
Well, I think it's done! I close this issue.
Following Alex's advices, I just read the following paper :
https://github.com/pyannote/pyannote-metrics/blob/master/docs/pyannote-metrics.pdf
presenting pyannote.metrics, a toolkit for evaluating speaker diarization systems. It gets basically rid of a lot of things I don't like in the current pipeline :
The fact that the diarization evaluation relies only on the DER. The first step of this metric consists of computing an optimal one-to-one mapping. But, it raises some questions : what happens when the number of classes is different in the reference & hypothesis file ? what are the impact on the metric when the one-to-one mapping fails ? More than that, in some cases we're wasting information that we have : when we know the mapping between the reference and hypothesis classes, we want to use it. No needs to compute it.
Diagnostic capabilities : currently, there are none. What kinds of error our model does ? Does it over-segment (many short clusters) the audio, as LENA seems to do (even though we don't have any metrics to prove it) ? Does it under-segment (few long clusters) it ? On which specific speaker our model fails ? Those are many interesting question that can help solving a specific problem (for improving or assessing a model)
The wall between diarization vs speech activity detection evaluation. The SAD and the diarization tasks are similar and the difference in the evaluation pipeline should only appear at the end (in the metrics choice).