Rethink the evaluation pipeline

MarvinLvn commented 5 years ago

Following Alex's advices, I just read the following paper :

https://github.com/pyannote/pyannote-metrics/blob/master/docs/pyannote-metrics.pdf

presenting pyannote.metrics, a toolkit for evaluating speaker diarization systems. It gets basically rid of a lot of things I don't like in the current pipeline :

The fact that the diarization evaluation relies only on the DER. The first step of this metric consists of computing an optimal one-to-one mapping. But, it raises some questions : what happens when the number of classes is different in the reference & hypothesis file ? what are the impact on the metric when the one-to-one mapping fails ? More than that, in some cases we're wasting information that we have : when we know the mapping between the reference and hypothesis classes, we want to use it. No needs to compute it.
Diagnostic capabilities : currently, there are none. What kinds of error our model does ? Does it over-segment (many short clusters) the audio, as LENA seems to do (even though we don't have any metrics to prove it) ? Does it under-segment (few long clusters) it ? On which specific speaker our model fails ? Those are many interesting question that can help solving a specific problem (for improving or assessing a model)
The wall between diarization vs speech activity detection evaluation. The SAD and the diarization tasks are similar and the difference in the evaluation pipeline should only appear at the end (in the metrics choice).

[x] Replace the folder evaluation pipeline
[x] Update the documentation
[x] Get feedback and improve the new evaluation pipeline

MarvinLvn commented 5 years ago

Example of diarization evaluation output :

diarization_eval

All metrics that have been integrated so far :

Diarization (Clusterization of speaker roles) :
- Completeness
- Coverage
- Diarization Error Rate
- Homogeneity
- Purity
Detection (Speech activity detection) :
- Accuracy
- Detection Error Rate
- Precision
- Recall
Identification (Identification of speaker roles) :
- Identification Error Rate
- Precision
- Recall
A visualization module that extracts the 1 minute length chunk that has the highest volubility :

lena_ber_5750_030120_030240

--> Still needs to be improved (I tried to set the segments transparency for an hour without succeeding ! T_T)

MarvinLvn commented 5 years ago

Well, I think it's done! I close this issue.

srvk / DiViMe

Rethink the evaluation pipeline #108