m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
12.22k stars 1.3k forks source link

Details about how to compute Recall/Precision exactly as in the paper? #329

Open Jeronymous opened 1 year ago

Jeronymous commented 1 year ago

Is it possible to see the code used to compute the evaluation metrics as in the paper ("WhisperX: Time-Accurate Speech Transcription of Long-Form Audio")? (as the devil's in the details...)

Jeronymous commented 1 year ago

As I wrote on #125 :

The paper only mentions how the "true positive" are counted, which is kind of the easy part. But what is done when a target word correspond to several predicted word (for instance)? How are silence segments taken into account exactly?... I'm discussing this with a colleague and we are imagining several implementations that make sense. Clarifying this would be super helpful to reproduce the benchmark

It would be awesome to have an update of the paper regarding this.

prevotlaurent commented 1 year ago

Interested to hear about this issue as well. As it is currently explained, the metrics seems interesting but they are several ways (and different results) for implementing it.

Jeronymous commented 1 year ago

@m-bain Any chance to have an answer on this?

Even an isolated not working piece of code would help to understand how Recall/Precision are exactly computed.

Thanks

m-bain commented 1 year ago

Hi,

Sorry for the delay.

It is simply calculated as so:

Precision:

For each predicted word, check if an exact string match (normalized) occurs in the list of ground truth words -- within collar margin. If yes, count as a TP.

Recall:

For each ground truth word, check ... same as above but against predicted words