pyannote / pyannote-metrics

A toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems
http://pyannote.github.io/pyannote-metrics
MIT License
186 stars 33 forks source link

Add summation #67

Open nryant opened 1 year ago

nryant commented 1 year ago

Overview

Adds following methods to BaseMetric to support summation:

Motivation

The motivation is two-fold:

  1. multiprocessing

Suppose we want to compute metrics for a large volume of data -- sufficiently large that we would like to parallelize the collection of sufficient statistics. Computing the sufficient statistics using multiprocessing or joblib is straightforward; e.g., using joblib:

def process_one(reference, hypothesis, uem):
   """Wrapper function for multiprocessing."""
    der = DiarizationErrorRate()
    der(reference, hypothesis, uem=uem)
    return der

def main():
    # Mapping from URIs to reference annotations.
    reference_anns = ...

    # Mapping from URIs to hypothesis annotations.
    hypothesis_anns = ...

    # Mapping from URIs to scoring regions.
    uems = ...

    # Number of parallel processes to use.
    n_jobs = 10

    # Actual joblib call.
    f = delayed(process_one)
    file_ders = Parallel(n_jobs)(f(reference_anns[uri], hypothesis_anns[uri], uems[uri])
                                 for uri in reference_anns)

However, the sufficient statistics for DER computation are now spread across the file-level metrics. Combining these into a single instance reduces to:

combined_der = sum(file_ders)
  1. aggregation

Suppose we want to compute metrics not just overall and at a file-level, but by various logical subdivision; e.g., DIHARD III domains. This is now trivial using pandas dataframes. E.g. suppose data contains the following columns:

then:

data.groupby('domain', as_index=False).sum()

will contain domain-level metrics, from which suitable reports may be generated.

hbredin commented 1 year ago

I have struggled in the past to make pyannote.metrics work well in a distributed way.

Your proposed solution is great! Thanks @nryant.

hbredin commented 1 year ago

Adding back a "parallel processing" entry in the documentation (mentioning your joblib example, for instance) would be nice to have as well!

https://github.com/pyannote/pyannote-metrics/commit/4c1be0115801297533f258c94266db2d8cff17cb