Open nryant opened 1 year ago
Adding back a "parallel processing" entry in the documentation (mentioning your joblib example, for instance) would be nice to have as well!
https://github.com/pyannote/pyannote-metrics/commit/4c1be0115801297533f258c94266db2d8cff17cb
Overview
Adds following methods to
BaseMetric
to support summation:__add__
__radd__
Motivation
The motivation is two-fold:
Suppose we want to compute metrics for a large volume of data -- sufficiently large that we would like to parallelize the collection of sufficient statistics. Computing the sufficient statistics using
multiprocessing
orjoblib
is straightforward; e.g., usingjoblib
:However, the sufficient statistics for DER computation are now spread across the file-level metrics. Combining these into a single instance reduces to:
Suppose we want to compute metrics not just overall and at a file-level, but by various logical subdivision; e.g., DIHARD III domains. This is now trivial using
pandas
dataframes. E.g. supposedata
contains the following columns:file_id
-- file iddomain
-- domain file is fromder
-- instance ofDiarizationErrorRate
then:
will contain domain-level metrics, from which suitable reports may be generated.